Introduction
3.1 Deleting unnecessary columns
3.2 Renaming columns
3.3 Implementation of new feature (target variable) - wine_type
3.4 Concatenation of datasets to 1 dataset also Pandas Profiling and Sweetviz reports
3.5 Enumerative columns
3.6 Data types
3.7 Duplicates
3.8 Missing values
3.9 Outliers - Boxplot, Isolation Forest, Hampel
Boxplot
Isolation FOrest
Hampel
3.10 Balance of target variable (wine_type)
3.11 Analysis of distribution of explanatory variables - on 5 ways
Histograms
Kolmogorov-Smirnov test
Shapiro-Wilk test
normal test from Scipy
kurtosis and skew
mean and median
6.1 CORR - Correlation - Pearson / Spearman
Correlation between target and independent variables
Correlation between independent variables
6.2 Variance Inflation Factor (VIF)
6.3 Information Value (IV)
6.4 Sequential feature selection - Forward / Stepwise
Forward selection
Backward selection
6.5 TREE
6.6 Recursive Feature Elimination (RFE)
6.7 Summary of features selection7.Oversampling - SMOTE
8 Construction of functions for accelerating work
8.1 Confusion matrix
8.2 Classification report
8.3 Comparison of statistics from train and test datasets
8.4 ROC curve for train and test dataset9 Machine Learning Models
9.1 Logistic Regression
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Features importance
Results and save results to Excel file
9.2 KNN
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file
9.3 SVM
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file
9.4 Naive Bayes
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file
9.5 Decision Tree
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file
9.6 Random Forest
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file
9.7 XGBoost
Model building with tunning of hyper parameters and tunning of train test split
Model evaluation - confusion matrix, classification report, comparision of statistics on train and test datasets (Accuracy, Precision, Recall, F1, AUC, Gini), ROC, PROFIT, LIFT
Results and save results to Excel file10 Comparision of statistics of models and ROC plot of all models
10.1 Comparision of statistics of models
10.2 Comparision of models on ROC curve with AUC>11 Conclusions
Target variable:
1 - red wine
0 - white wine
Problem description:
Build, evaluate and compare classification models to choose the best model to predict type of wine.
Programming language:
Python
Libraries:
Scikit-learn, SciPy, Statsmodels, Pandas, NumPy, Matplotlib, Seaborn, Scikitplot, yellowbrick, xgboost
Algorithms:
Isolation Forest, Hampel, Kolmogorov-Smirnov, Shapiro-Wilk, normal test from Scipy, dummy coding, Pearson / Spearman corr, VIF, IV, Forward/ Backwad, TREE, RFE
Models built:
Logistic Regression, KNN, SVM, Naive Bayes, Decision Tree, Random Forest, XGBoost
Methods of model evaluation
Confusion matrix, classification report, Accuracy, Precision, Recall, F1, AUC, Gini, ROC, PROFIT, LIFT, comparision of statistics on train and test datasets
WARNING !
The modelling dataset (input dataset after modifications) is really small because has only 7922 observations and 13 variables include target variable (wine_type). By doing so, results of models may be overfitted because regardless of algorithms chosen, hiper parameters tunning or data engineering techniques implement, dataset large enought really important for models, good quality data is more important than algorithms.
#Librarires
import pandas as pd
import pandas_profiling
import numpy as np
import sweetviz as sv
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
#from pylab import rcParms
import matplotlib.patches as mpatches
from pylab import *
import scikitplot as skplt
import datetime
import os
import statsmodels.api as sm
from scipy.stats import norm
import pylab
import statsmodels.stats
from sklearn.ensemble import IsolationForest
from hampel import hampel
from scipy.stats import kstest
from scipy.stats import shapiro
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from functools import reduce
from statsmodels.stats.outliers_influence import variance_inflation_factor
from yellowbrick.classifier import ClassificationReport
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from helpers import *
from sklearn import neighbors
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import joblib
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
#Off science notation
pd.set_option("display.float_format", lambda x: "%.3f" % x)
#Set style of plots
plt.style.use("ggplot")
#Version of Python and libraries used
from platform import python_version
import matplotlib
import sklearn
print("Python version is {}".format(python_version()))
print("Pandas version is {}".format(pd.__version__))
print("Scipy version is {}".format(scipy.__version__))
print("Scikit-learn is {}".format(sklearn.__version__))
print("Statsmodels is {}".format(sm.__version__))
print("Numpy version is {}".format(np.__version__))
print("Matplotlib version is {}".format(matplotlib.__version__))
print("Seaborn version is {}".format(sns.__version__))
print("XGBoost version is {}".format(xgb.__version__))
#To calculate time of full code compilation
start = datetime.datetime.now()
#Reading of datasets
red_wines_features = pd.read_csv("winequality-red.csv", header = 0, sep = ";")
white_wines_features = pd.read_csv("winequality-white.csv", header = 0, sep = ";")
#Displaying of the beginning and the end of the red_wines_features dataset
display(red_wines_features.head(3))
display(red_wines_features.tail(3))
#Displaying of the beginning and the end of the white_wines_type dataset
display(white_wines_features.head(3))
display(white_wines_features.tail(3))
#Shapes of dataset
print("Red wines:", red_wines_features.shape)
print("White wines:", white_wines_features.shape)
From a business perspective, all variables look necessary.
#Chaning of columns names in the red_wines_type dataset
new_columns_names_red = {"fixed acidity" : "fixed_acidity",
"volatile acidity" : "volatile_acidity",
"citric acid" : "citric_acid",
"residual sugar" : "residual_sugar",
"free sulfur dioxide" : "free_sulfur_dioxide",
"total sulfur dioxide" : "total_sulfur_dioxide"}
red_wines_features.rename(columns = new_columns_names_red, inplace= True)
red_wines_features.sample()
#Chaning of columns names in the red_wines_type dataset
new_columns_names_white = {"fixed acidity" : "fixed_acidity",
"volatile acidity" : "volatile_acidity",
"citric acid" : "citric_acid",
"residual sugar" : "residual_sugar",
"free sulfur dioxide" : "free_sulfur_dioxide",
"total sulfur dioxide" : "total_sulfur_dioxide"}
white_wines_features.rename(columns = new_columns_names_white, inplace= True)
white_wines_features.sample()
#Implementation of target variable (about wine's type) into both dataset
red_wines_features["wine_type"] = 1
white_wines_features["wine_type"] = 0
#Sample of dataset after modification
display(red_wines_features.sample())
display(white_wines_features.sample())
#Concatenation of 2 datasets: red_wines_features and white_wines_features into one dataset with target varaible: "wine_type"
data = pd.concat([red_wines_features, white_wines_features], axis = 0)
data.head(5)
#Shape of full dataset
print("Full dataset:", data.shape)
#Short dataset description
print("The dataset contains {} observations as well as {} variables.".format(data.shape[0], data.shape[1]))
profile = ProfileReport(data)
profile.to_file("Pandas_Profiling_Report.html")
sweet_report = sv.analyze(data)
sweet_report.show_html('Sweetviz.html')
#Creation new variable: high_quality_with_sugar - high quality wine with sugar
def high_quality_with_sugar(x):
if x["residual_sugar"] > 2.5 and x["quality"] > 5:
return 1
else:
return 0
data["high_quality_with_sugar"] = data.apply(lambda x: high_quality_with_sugar(x), axis = 1)
data["high_quality_with_sugar"] = data["high_quality_with_sugar"].astype("int64")
#Creation new variable: high_quality_without_sugar - high quality wine without sugar
def high_quality_without_sugar(x):
if x["residual_sugar"] <= 2.5 and x["quality"] > 5.0:
return 1
else:
return 0
data["high_quality_without_sugar"] = data.apply(lambda x: high_quality_without_sugar(x), axis = 1)
data["high_quality_without_sugar"] = data["high_quality_without_sugar"].astype("int64")
#Creation new variable: low_quality_with_sugar - low quality wine without sugar
def low_quality_with_sugar(x):
if x["residual_sugar"] > 2.5 and x["quality"] <= 5.0:
return 1
else:
return 0
data["low_quality_with_sugar"] = data.apply(lambda x: low_quality_with_sugar(x), axis = 1)
data["low_quality_with_sugar"] = data["low_quality_with_sugar"].astype("int64")
#Creation new variable: low_quality_without_sugar - low quality wine without sugar
def low_quality_without_sugar(x):
if x["residual_sugar"] <= 2.5 and x["quality"] <= 5.0:
return 1
else:
return 0
data["low_quality_without_sugar"] = data.apply(lambda x: low_quality_without_sugar(x), axis = 1)
data["low_quality_without_sugar"] = data["low_quality_without_sugar"].astype("int64")
#Creation new variable: alcohol_sgar
def alcohol_sugar(x):
"""
Creation of varaible: "alcohol_sugar" present level of residual_sugar and alcohol in wine.
"""
if x["residual_sugar"] > 10.0 and x["alcohol"] > 11.5:
return "high_alco_sweet"
elif (x["residual_sugar"] > 5.0 and x["residual_sugar"] <= 10.0) and (x["alcohol"] >10.4 and x["alcohol"] <= 11.5):
return "medium_alco_sweet"
elif (x["residual_sugar"] > 3.2 and x["residual_sugar"] <= 5.0) and (x["alcohol"] >10.0 and x["alcohol"] <= 10.4):
return "low_alco_sweet"
elif x["residual_sugar"] <= 3.2 and x["alcohol"] <= 10.0:
return "minimum_alco_sweet"
data["alcohol_sugar"] = data.apply(lambda x: alcohol_sugar(x), axis = 1)
data["alcohol_sugar"] = data["alcohol_sugar"].astype("object")
#Creation new variable: sulphates_level
def sulphates_level(x):
"""
Creation of varaible: "sulphates_level" present level of sulphates.
"""
if x["sulphates"] > 0.7:
return "high"
elif x["sulphates"] > 0.5 and x["sulphates"] <= 0.7:
return "medium"
elif x["sulphates"] > 0.3 and x["sulphates"] <= 0.5:
return "low"
elif x["sulphates"] <= 0.3:
return "minimum"
data["sulphates_level"] = data.apply(lambda x: sulphates_level(x), axis = 1)
data["sulphates_level"] = data["sulphates_level"].astype("object")
#Shape of full dataset with enumerative varaibles
print("Full dataset:", data.shape)
#Short dataset description with enumerative varaibles
print("The dataset with enumerative variables contains {} observations as well as {} variables.".format(data.shape[0], data.shape[1]))
#Information about types of data
data.info()
#Lists of columns based of data type
print("Numeric variables:", data.select_dtypes(include=["int64", "float64"]).columns.tolist())
print("Categorical variables:", data.select_dtypes(include=["object"]).columns.tolist())
print("Datetime variables:", data.select_dtypes(include=["datetime64"]).columns.tolist())
#Search of duplicated obserwations
data[data.duplicated(data.columns.tolist(), keep=False)]
#Drop of duplicated obserwations
pd.DataFrame.drop_duplicates(data, inplace=True)
#Shape of data after drop of duplicated rows
data.shape
#Ensuring that deletion of duplicates has worked correctly
data[data.duplicated(data.columns.tolist(), keep=False)]
#Restore index order
data.reset_index(inplace=True)
#Detection of missing values
summary = pd.DataFrame(data.dtypes, columns=['Feature type'])
summary["Is_Null"] = pd.DataFrame(data.isnull().any())
summary["Sum_Null"] = pd.DataFrame(data.isnull().sum())
summary["Is_NaN"] = pd.DataFrame(data.isna().any())
summary["Sum_NaN"] = pd.DataFrame(data.isna().sum())
summary["Null_perc"] = round((data.apply(pd.isnull).mean()*100),2)
summary["NaN_perc"] = round((data.apply(pd.isna).mean()*100),2)
summary
#Checking of missing values in the dataset
print("Null values:",data.isnull().sum().sum())
print("NaN values:",data.isna().sum().sum())
#Heatmap of missing values
#Size of the plot
plt.figure(figsize=(20,5))
#Creation of the heatmap
sns.heatmap(data.isnull(),
yticklabels=False,
cbar=False,
cmap="bone").set_title("Heatmap of missing values in the dataset",
fontsize = 25,
color = "darkblue")
#X-axis descriptions in terms of rotation and size of caption
plt.xticks(rotation=45, fontsize=18, horizontalalignment="right", color="darkblue")
plt.show()
#Drop of alcohol_sugar column, because > 75% of NaN is definitely too much
data.drop(columns=["alcohol_sugar"], inplace=True)
#Numeric columns from dataset - for boxplots
data_num_col = data.select_dtypes(include=["int64", "float64"])
data_num_col.drop(columns=["index"], inplace=True)
#Numeric columns from dataset without target- for Isolation Forest and Hampel
num_col = [x for x in data_num_col.columns.tolist() if x != "wine_type"]
#Boxplots
for column in data_num_col:
ax = plt.figure(figsize=(8,5))
data_num_col.boxplot([column])
#Boxplots on 1 plot
rcParams["figure.figsize"] = 18,5
data_num_col.plot(kind="box")
plt.title("Boxplots with outliers", fontsize=15)
plt.xlabel("Variable", fontsize = 12)
plt.xticks(fontsize=10)
plt.show()
#Summary of outliers detection by boxplot
print("Based on boxplots there are outliers in following variables:")
print(
"- fixed_acidity""\n"
"- volatile_acidity""\n"
"- citric_acid""\n"
"- residual_sugar""\n"
"- chlorides""\n"
"- free_sulfur_dioxide""\n"
"- total_sulfur_dioxide""\n"
"- density""\n"
"- pH""\n"
"- sulphates""\n"
"- alcohol""\n"
"- quality""\n")
#Isolation Forest model to find obserwations with outliers
model = IsolationForest(n_estimators=50, max_samples='auto', max_features=1.0)
#Traning of IF model on numeric columns of dataset
model.fit(data[num_col])
#Adding score and animaly columns which show obserwations with outliers in dataset
data["scores"] = model.decision_function(data[num_col])
data["anomaly"] = model.predict(data[num_col])
data
#DataFrame with outliers from dataset
outliers = data.loc[data["anomaly"]== - 1]
#Indexes of obserwations with outliers
anomaly_index = outliers.index.tolist()
#Observations with outliers
outliers
#Printing amount of observations with outliers and indexes of observations with outliers
print("Dataset contains:",outliers.shape[0], "observations with outliers.")
print("Outliers are in following indexes:", anomaly_index)
Removing all observations (above indexes) containing outliers in each column would result in the loss of a lot of information when looking at the number of rows that would need to be removed. Thus, it is more reasonable to use the Hampel method for outliers.
#Hampel method used on variables where are outliers based on boxplots
fixed_acidity_hampel = hampel(data["fixed_acidity"], window_size=2, n=3).to_frame()
volatile_acidity_hampel = hampel(data["volatile_acidity"], window_size=2, n=3).to_frame()
citric_acid_hampel = hampel(data["citric_acid"], window_size=2, n=3).to_frame()
residual_sugar_hampel = hampel(data["residual_sugar"], window_size=2, n=3).to_frame()
chlorides_hampel = hampel(data["chlorides"], window_size=2, n=3).to_frame()
free_sulfur_dioxide_hampel = hampel(data["free_sulfur_dioxide"], window_size=2, n=3).to_frame()
total_sulfur_dioxide_hampel = hampel(data["total_sulfur_dioxide"], window_size=2, n=3).to_frame()
density_hampel = hampel(data["density"], window_size=2, n=3).to_frame()
pH_hampel = hampel(data["pH"], window_size=2, n=3).to_frame()
sulphates_hampel = hampel(data["sulphates"], window_size=2, n=3).to_frame()
alcohol_hampel = hampel(data["alcohol"], window_size=2, n=3).to_frame()
quality_hampel = hampel(data["quality"], window_size=2, n=3).to_frame()
#Dataset with index and target variable - dataset to merge with variables after Hampel method by index
data = data[["wine_type", "high_quality_with_sugar", "high_quality_without_sugar", "low_quality_with_sugar",
"low_quality_without_sugar", "sulphates_level"]]
data
#Merge of dataset with target and variables after Hampel method by index
df1 = pd.merge(data, fixed_acidity_hampel, left_index=True, right_index=True)
df2 = pd.merge(df1, volatile_acidity_hampel, left_index=True, right_index=True)
df3 = pd.merge(df2, citric_acid_hampel, left_index=True, right_index=True)
df4 = pd.merge(df3, residual_sugar_hampel, left_index=True, right_index=True)
df5 = pd.merge(df4, chlorides_hampel, left_index=True, right_index=True)
df6 = pd.merge(df5, free_sulfur_dioxide_hampel, left_index=True, right_index=True)
df7 = pd.merge(df6, total_sulfur_dioxide_hampel, left_index=True, right_index=True)
df8 = pd.merge(df7, density_hampel, left_index=True, right_index=True)
df9 = pd.merge(df8, pH_hampel, left_index=True, right_index=True)
df10 = pd.merge(df9, sulphates_hampel, left_index=True, right_index=True)
df11 = pd.merge(df10, alcohol_hampel, left_index=True, right_index=True)
df12 = pd.merge(df11, quality_hampel, left_index=True, right_index=True)
#Dataset after used Hampel method on variables with outliers
data= df12.copy()
#Dataset after using Hampel method to change outliers values
data
#Distribution of wine_type (target variable) in percent in the dataset
default_percent_dist = data["wine_type"].value_counts(normalize=True).round(3).to_frame()*100
default_percent_dist["wine_type distribution"] = data["wine_type"].value_counts().round(3).to_frame()
default_percent_dist.columns=["wine_type distribution in %", "wine_type distribution"]
default_percent_dist
#Distribution of target variable
plt.figure(figsize=(18, 5))
ax=sns.countplot(x="wine_type",
data=data,
palette = ["grey", "red"])
ax.set_title("Distribution of target variable - wine_type", fontsize=20)
plt.xlabel("wine type",fontsize=15)
plt.ylabel("count", fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
ax.set(ylim=(0, 5000))
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}',
(p.get_x()+0.2,
p.get_height()),
ha="center",
color="black",
size=18)
#Settings of legend
darkcyan_patch = mpatches.Patch(color="grey", label= "white wine")
darkslategray_patch = mpatches.Patch(color="red", label= "red wine")
plt.legend(handles=[darkcyan_patch,
darkslategray_patch],
loc="best",
prop={"size": 15},
title="bank term deposit",
title_fontsize="15",
frameon=True)
plt.show()
H0 = the variable comes from a normal distribution \ H1 = the variable does not come from a normal distribution
Adopted materiality level alpha = 0.05
No variable comes from a normal distribution.
#Columns to histogram (all without target variable)
to_hist = data.loc[:, data.columns != "wine_type"]
#Plot histograms
to_hist.hist(bins=50, figsize=(20,15), color="green")
plt.show()
Each variables has p-value < 0.05, so need reject H0 in favor of H1, so no variable is normally distributed.
#Kolmogorov-Smirnov normal distirbution test
print("Kolmogorov-Smirnov test:")
print("")
print(kstest(data[["fixed_acidity"]], "norm"))
print(kstest(data[["volatile_acidity"]], "norm"))
print(kstest(data[["citric_acid"]], "norm"))
print(kstest(data[["residual_sugar"]], "norm"))
print(kstest(data[["chlorides"]], "norm"))
print(kstest(data[["free_sulfur_dioxide"]], "norm"))
print(kstest(data[["total_sulfur_dioxide"]], "norm"))
print(kstest(data[["density"]], "norm"))
print(kstest(data[["pH"]], "norm"))
print(kstest(data[["sulphates"]], "norm"))
print(kstest(data[["alcohol"]], "norm"))
print(kstest(data[["quality"]], "norm"))
Each variables has p-value < 0.05, so need reject H0 in favor of H1, so no variable is normally distributed.
#Shapiro-Wilk normal distirbution test
print("Shapiro-Wilk test:")
print("")
print(shapiro(data[["fixed_acidity"]]))
print(shapiro(data[["volatile_acidity"]]))
print(shapiro(data[["citric_acid"]]))
print(shapiro(data[["residual_sugar"]]))
print(shapiro(data[["chlorides"]]))
print(shapiro(data[["free_sulfur_dioxide"]]))
print(shapiro(data[["total_sulfur_dioxide"]]))
print(shapiro(data[["density"]]))
print(shapiro(data[["pH"]]))
print(shapiro(data[["sulphates"]]))
print(shapiro(data[["alcohol"]]))
print(shapiro(data[["quality"]]))
Loop confirmation that each variables has p-value < 0.05, so need reject H0 in favor of H1, so no variable is normally distributed.
#Shapiro-Wilk normal distirbution test in loop
print("Shapiro-Wilk test:")
print("")
data_numerical = data[list(data.select_dtypes(include=["int64", "float64"]))]
results = []
for feature in data_numerical.columns:
alpha = 0.05
p_value = shapiro(data[feature])[1]
results.append([feature, p_value])
if(p_value < alpha):
print("For variable \"" + feature +
"\" I reject the zero hypothesis.\n The variable DOES NOT HAVE normal distribution. P-value:", p_value)
else:
print("For variable \"" + feature +
"\" no grounds for rejecting the zero hypothesis have been detected. The variable HAS normal distribution. P-value:", p_value)
Each variables has p-value < 0.05, so need reject H0 in favor of H1, so no variable is normally distributed.
#Select numerical variables from the dataset
data_numerical = data[list(data.select_dtypes(include=["int64", "float64"]))]
#Verification of hypotheses
results = []
for feature in data_numerical.columns:
alpha = 0.05
p_value = scipy.stats.normaltest(data_numerical[feature])[1]
results.append([feature, p_value])
if(p_value < alpha):
print("For variable \"" + feature +
"\" I reject the zero hypothesis.\n The variable DOES NOT HAVE normal distribution. P-value:", p_value)
else:
print("For variable \"" + feature +
"\" no grounds for rejecting the zero hypothesis have been detected. The variable HAS normal distribution. P-value:", p_value)
Normal distribution parameters should be:
Kurtosis is a measure of flattening the decomposition. It tells how flattened (negative values of kurtosis) or bloated (positive values of kurtosis) the distribution of the examined variable is in relation to normal distribution.
Interpretation of skew:
#Calculation of kurtosis and skew for each numerical variables by aggregation method
data_numerical.agg(["kurtosis", "skew"]).T
According to the assumptions of normal distribution, the mean value and the median should be very close (preferably equal) to each other. Then there is a chance to suspect the distribution of the variable of lack of diagonality.
Differences in values show that distribution is skewed:
#Calculation of mean and median for each numerical variables by aggregation method
data_numerical.agg(["mean", "median"]).T
#Scatter plots of variables to compare alcohol strength with other features of wines
plt.figure(figsize=(15, 12))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=1.0)
subplot(5,2,1)
sns.scatterplot(data=data,x='fixed_acidity',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,2)
sns.scatterplot(data=data,x='volatile_acidity',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,3)
sns.scatterplot(data=data,x='citric_acid',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,4)
sns.scatterplot(data=data,x='residual_sugar',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,5)
sns.scatterplot(data=data,x='free_sulfur_dioxide',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,6)
sns.scatterplot(data=data,x='total_sulfur_dioxide',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,7)
sns.scatterplot(data=data,x='density',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,8)
sns.scatterplot(data=data,x='pH',y='alcohol',hue='wine_type',legend='full')
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
subplot(5,2,9)
sns.scatterplot(data=data,x='sulphates',y='alcohol',hue='wine_type',legend="full")
plt.title('Color Intensity vs % of Alcohol \n' ,fontsize=13)
plt.xlabel('Color',fontsize=13)
plt.ylabel('% of Alcohol',fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="best",
prop={"size": 10},
title="wine_type",
title_fontsize="10",
frameon=True)
plt.show()
#Categorical columns without target variable
categorical_columns = [x for x in data.select_dtypes(include=["object"]).columns.tolist() if x != "wine_type"]
categorical_columns
#Unique values of categorical variable - levels of variable
print(data["sulphates_level"].unique())
#Dummy coding of "sulphates_level"
data=pd.get_dummies(data,
columns=categorical_columns,
drop_first=True) #parameter = True, so as to avoid collinearity
#Dataset after dummy coding
data
Finally, because of small dataset, as a data selection were used only CORR and VIF methods.
#Pearson correlation
corr_pearson = data.corr(method="pearson").abs()[["wine_type"]].sort_values(by="wine_type", ascending=False)
corr_pearson.rename(columns={"wine_type" : "wine_type_corr_pearson"}, inplace=True)
correlation_matrix = pd.DataFrame(np.abs( data.corr(method="pearson")), columns = data.columns, index = data.columns)
correlation_matrix.drop("wine_type", axis = 0, inplace = True)
correlation_matrix.reset_index(inplace=True)
plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
sns.barplot(data = correlation_matrix.sort_values('wine_type', ascending=False),
x = 'wine_type',
y = 'index',
palette = "inferno_r")
plt.title("Coefficient of PEARSON correlation between target and independent variables", fontsize=20)
plt.xlabel("correlation coefficient", fontsize=15)
plt.ylabel("variable", fontsize=15)
plt.show()
#Spearman correlation
corr_spearman = data.corr(method="spearman").abs()[["wine_type"]].sort_values(by="wine_type", ascending=False)
corr_spearman.rename(columns={"wine_type" : "wine_type_corr_spearman"}, inplace=True)
correlation_matrix = pd.DataFrame(np.abs(data.corr(method="spearman")), columns = data.columns, index = data.columns)
correlation_matrix.drop('wine_type', axis = 0, inplace = True)
correlation_matrix.reset_index(inplace=True)
plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
sns.barplot(data = correlation_matrix.sort_values('wine_type', ascending=False),
x = 'wine_type',
y = 'index',
palette = 'YlGn_r')
plt.title("Coefficient of SPEARMAN correlation between target and independent variables", fontsize=20)
plt.xlabel("correlation coefficient", fontsize=15)
plt.ylabel("variable", fontsize=15)
plt.show()
#Merge of correlations data frames
corr_merge=pd.merge(corr_pearson,corr_spearman,left_index=True,right_index=True).sort_values(by=["wine_type_corr_spearman",
"wine_type_corr_pearson"],
ascending=False)
corr_merge
#Highest corr with target - Spearman corr does not need normal distribution (in contrast to Pearson), so in this dataset
#Spearman is preferable CORR to use
corr_treshold = 0.70
high_corr = corr_merge[corr_merge["wine_type_corr_spearman"] > corr_treshold][["wine_type_corr_spearman"]].index.tolist()
corr_to_drop = [column for column in high_corr if column != "wine_type"]
corr_to_drop
#Drop column with too high correlation with target
data.drop(columns=corr_to_drop, inplace=True)
#Heatmap of correlations - as above in this dataset Spearman corr is preferable
plt.figure (figsize = (15,7))
cor = data.corr(method="spearman").abs()
sns.heatmap (cor, annot = True, annot_kws={"size": 11}, cmap = plt.cm.Reds)
plt.title("CORR Spearman between independent variables", fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()
#Spearman correlation between independent variables, where CORR > 0.70
corr = data.corr(method="spearman").abs()
plt.figure(figsize=(15, 9))
sns.heatmap(corr[(corr >= 0.7)],
cmap='inferno', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 12}, square=True)
plt.title("CORR Spearman between independent variables CORR > 0.70", fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()
#Columns to drop afret checking CORR between independent varaibles:
#free_sulfur_dioxide - high correlated (0.71) with total_sulfur_dioxide, but total_sulfur_dioxide higher correlated
#with target (but not > 0.70 correlated with target)
#sulphates_level_low - high correlated (0.71) with sulphates, but sulphates higher correlated
#with target (but not > 0.70 correlated with target), moreover sulphates_level_low also
#high correlated with sulphates_level_medium
#Data before drop - will use on the next cell
data_before_drop = [column for column in data.columns.tolist() if column != "wine_type"]
#Dropping high correlated variables (with other independent variables)
data.drop(columns=["free_sulfur_dioxide", "sulphates_level_low"], inplace=True)
#Column selected by IV
CORR_selected_features = [column for column in data if column != "wine_type"]
all_columns = data_before_drop
do_not_selected_CORR = [x for x in all_columns if x not in CORR_selected_features]
print("Features selected by CORR:", CORR_selected_features)
print("Number of selected features:", len(CORR_selected_features))
print()
print("Features do not selected:", do_not_selected_CORR)
print("Number of do not selected feature:", len(do_not_selected_CORR))
VIF - factor that assesses the degree of collinearity between explanatory variables in statistical models
If the VIF is between 5–10, multicollinearity is likely present and it should be consider to drop the variable. \ If VIF>10 there is a strong collinearity of explanatory variables, it should be removed from the dataset.
#Data Frame with column to VIF calculation
cols_to_VIF = data.loc[:, data.columns != "wine_type"]
#Option 1 - VIF calculation by created function
#Function to calculate VIF indicator for selected seatures
def calculation_VIF(X):
"""
Building DataFrame with fetures selected to model (parameter X)
and calculating of VIF indicator for them.
"""
vif = pd.DataFrame()
vif["DF_features"] = cols_to_VIF.columns
vif["VIF"] = [variance_inflation_factor(cols_to_VIF.values, i) for i in range(cols_to_VIF.shape[1])]
vif["VIF"] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False).set_index("DF_features")
return(vif)
#Activation of function to calculate VIF for selected features
calculation_VIF(cols_to_VIF)
#Option 2 - VIF calculation by function from statsmodels library
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["DF_features"] = cols_to_VIF.columns
#Adding column with VIF
vif_data["VIF"] = [variance_inflation_factor(cols_to_VIF.values, i)
for i in range(len(cols_to_VIF.columns))]
vif_data = vif_data.sort_values(by="VIF", ascending=False).set_index("DF_features")
vif_data
#Drop variables with too high VIF factor
data.drop(columns=vif_data[vif_data["VIF"]>10].index.tolist(), inplace=True)
#Column selected by IV
VIF_selected_features = vif_data[vif_data["VIF"] < 10].index.tolist()
do_not_selected_IV = vif_data[vif_data["VIF"] > 10].index.tolist()
print("Features selected by IV:", VIF_selected_features)
print("Number of selected features:", len(VIF_selected_features))
print()
print("Features do not selected:", do_not_selected_IV)
print("Number of do not selected feature:", len(do_not_selected_IV))
Variables were selected based on IV determining the predictive power of each variable. Unfortunately it was not possible to find any library with a ready-made method of determining IV indicator, so the function was made with help of Stackoverflow.com.
larger than 0.5 - suspicious or too good to be true
Only variables selected for the model were those that are medium or strong predictors.
#Function for calculation of IV and WOE
def iv_woe_calculation(data, target, bins=10):
"""
WoE and IV calculation function based on dataset and target value - default.
"""
#Empty Dataframe
newDF,woeDF = pd.DataFrame(), pd.DataFrame()
#Extract Column Names
cols = data.columns
#Run WOE and IV on all the independent variables
for ivars in cols[~cols.isin([target])]:
if (data[ivars].dtype.kind in "bifc") and (len(np.unique(data[ivars]))>10):
binned_x = pd.qcut(data[ivars], bins, duplicates="drop")
d0 = pd.DataFrame({"x": binned_x, "y": data[target]})
else:
d0 = pd.DataFrame({'x': data[ivars], "y": data[target]})
d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
d.columns = ["Cutoff", "N", "Events"]
d["% of Events"] = np.maximum(d["Events"], 0.5) / d["Events"].sum()
d['Non-Events'] = d["N"] - d["Events"]
d["% of Non-Events"] = np.maximum(d["Non-Events"], 0.5) / d["Non-Events"].sum()
d["WoE"] = np.log(d["% of Events"]/d["% of Non-Events"])
d["IV"] = d["WoE"] * (d["% of Events"] - d["% of Non-Events"])
d.insert(loc=0, column='Variable', value=ivars)
print("Information value of " + ivars + " is " + str(round(d["IV"].sum(),6)))
temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d["IV"].sum()]}, columns = ["Variable", "IV"])
newDF=pd.concat([newDF,temp], axis=0)
woeDF=pd.concat([woeDF,d], axis=0)
#Activation of the function to calculate IV for each variable
iv_woe_calculation(data,"wine_type", bins=10)
#Column selected by IV
IV_selected_features = ["high_quality_with_sugar", "low_quality_without_sugar", "sulphates_level_medium"]
all_columns = [column for column in data.columns.tolist() if column != "wine_type"]
do_not_selected_IV = [x for x in all_columns if x not in IV_selected_features]
print("Features selected by IV:", IV_selected_features)
print("Number of selected features:", len(IV_selected_features))
print()
print("Features do not selected:", do_not_selected_IV)
print("Number of do not selected feature:", len(do_not_selected_IV))
#Train / Test split
X_forward = data.drop(labels=["wine_type"], axis=1)
y_forward = data["wine_type"]
X_train,X_test,y_train,y_test = train_test_split(X_forward, y_forward, train_size=0.6, test_size=0.4,random_state=10)
#Build RF classifier to use in feature selection
classifier = RandomForestClassifier(n_estimators=100, n_jobs=-1)
#Build step forward feature selection
forward_selection = sfs(classifier,
k_features="best",
forward=True,
floating=False,
verbose=0,
scoring='accuracy',
cv=5)
#Perform selection
forward_selection = forward_selection.fit(X_train, y_train)
##Indexes of selected variables
forward_features_selected = list(forward_selection.k_feature_idx_)
#Selected columns names
forward_selected_features = X_train.iloc[:,forward_features_selected].columns.tolist()
all_columns = [column for column in data.columns.tolist() if column != "wine_type"]
do_not_selected_forward = [x for x in all_columns if x not in forward_selected_features]
print("Features selected by FORWARD method:", forward_selected_features)
print("Number of selected features:", len(forward_selected_features))
print()
print("Features do not selected:", do_not_selected_forward)
print("Number of do not selected feature:", len(do_not_selected_forward))
#Train / Test split
X_backward = data.drop(labels=["wine_type"], axis=1)
y_backward = data["wine_type"]
X_train,X_test,y_train,y_test = train_test_split(X_backward, y_backward, train_size=0.6, test_size=0.4,random_state=10)
#Build RF classifier to use in feature selection
classifier = RandomForestClassifier(n_estimators=100, n_jobs=-1)
#Build step forward feature selection
backward_selection = sfs(classifier,
k_features="best",
forward=False,
floating=False,
verbose=0,
scoring='accuracy',
cv=5)
#Perform selection
backward_selection = backward_selection.fit(X_train, y_train)
#Indexes of selected variables
backward_features_selected = list(backward_selection.k_feature_idx_)
#Selected columns names
backward_selected_features = X_train.iloc[:,backward_features_selected].columns.tolist()
all_columns = [column for column in data.columns.tolist() if column != "wine_type"]
do_not_selected_backward = [x for x in all_columns if x not in backward_selected_features]
print("Features selected by FORWARD method:", backward_selected_features)
print("Number of selected features:", len(backward_selected_features))
print()
print("Features do not selected:", do_not_selected_backward)
print("Number of do not selected feature:", len(do_not_selected_backward))
#Fit an Extra Trees model to the data
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(criterion="gini")
model.fit(X = data.drop(labels=["wine_type"], axis=1), y = data["wine_type"])
#Display the relative importance of each attribute
print(model.feature_importances_)
#DataFrame of futures importance
importances_TREE = pd.DataFrame({"Feature":data.drop(labels=["wine_type"], axis=1).columns,
"Importance":np.round(model.feature_importances_,3)})
importances_TREE = importances_TREE.sort_values("Importance",ascending=False).set_index("Feature")
importances_TREE
#The most important features based on TREE
importances_TREE.index.tolist()
#Selected columns names
TREE_selected_features = importances_TREE.head(12).index.tolist()
all_columns = [column for column in data.columns.tolist() if column != "wine_type"]
do_not_selected_TREE = [x for x in all_columns if x not in TREE_selected_features]
print("Features selected by TREE method:", TREE_selected_features)
print("Number of selected features:", len(TREE_selected_features))
print()
print("Features do not selected:", do_not_selected_TREE)
print("Number of do not selected feature:", len(do_not_selected_TREE))
#Create the RFE model and select 3 attributes
rfe = RFE(estimator = RandomForestClassifier(),
n_features_to_select = 7,
verbose=0)
rfe = rfe.fit(X = data.drop(labels=["wine_type"], axis=1), y = data["wine_type"])
rfe_support = rfe.get_support()
#rfe_feature = X.loc[:,rfe_support].columns.tolist()
#Summarize the selection of the attributes
#print(data.drop(labels=["wine_type"], axis=1).columns.tolist())
print(rfe.support_)
print(rfe.ranking_)
#print("Selected feature:", rfe_feature)
RFE_results = pd.DataFrame({"Variable" : data.drop(labels=["wine_type"], axis=1).columns.tolist(),
"RFE support - is it selected? [T/F]" : rfe.support_,
"RFE ranking - if 1 then selected" : rfe.ranking_}).set_index("Variable")
RFE_results
#Variables selected by RFE (12)
features_selected_RFE = RFE_results[RFE_results["RFE ranking - if 1 then selected"]==1].index.tolist()
features_selected_RFE
#Selected columns names
RFE_selected_features = features_selected_RFE
all_columns = [column for column in data.columns.tolist() if column != "wine_type"]
do_not_selected_RFE = [x for x in all_columns if x not in RFE_selected_features]
print("Features selected by RFE method:", RFE_selected_features)
print("Number of selected features:", len(RFE_selected_features))
print()
print("Features do not selected:", do_not_selected_RFE)
print("Number of do not selected feature:", len(do_not_selected_RFE))
#Lists of features selected by different feature selection methods
CORR = CORR_selected_features
IV = IV_selected_features
FORWARD = forward_selected_features
BACKWARD = backward_selected_features
TREE = TREE_selected_features
RFE = RFE_selected_features
print("Features selected after CORR:", CORR)
print("*" * 127)
print("Features selected after IV:", IV)
print("*" * 127)
print("Features selected after FORWARD:", FORWARD)
print("*" * 127)
print("Features selected after BACKWARD:", BACKWARD)
print("*" * 127)
print("Features selected after TREE:", TREE)
print("*" * 127)
print("Features selected after RFE:", RFE)
#Common elements from each list of selected features using different methods
#Initializing list of lists
test_list = [CORR, IV, FORWARD, BACKWARD, TREE, RFE]
#Common element extraction form list of lists with selected features
#Using reduce() + lambda + set()
res = list(reduce(lambda i, j: i & j, (set(x) for x in test_list)))
#Result
print ("The common elements from list of lists with selected feature : " + str(res))
Target variable ("wine_type") is not balanced enought, so it is neccessairy to make oversampling to get balance in target.
#X and y from dataset
X = data.loc[:, data.columns != "wine_type"]
y = data.loc[:, data.columns == "wine_type"]
#SMOTE algorithm
smote = SMOTE(sampling_strategy='auto', random_state=111)
#Training of SMOTE
X_os, y_os = smote.fit_resample(X, y)
#Result of SMOTE in DF
columns = [column for column in data.columns.tolist() if column != "wine_type"]
X_os = pd.DataFrame(data=X_os, columns=columns )
y_os= pd.DataFrame(data=y_os, columns=["wine_type"])
#Shape of the dataset after oversampling with SMOTE method
print("Dataset after oversampling with SMOTE")
print("*"*40)
print("Number of observations in oversampled data: ",
len(X_os))
print("Number of white wine of deposit in oversampled data: ",
len(y_os[y_os['wine_type']==0]))
print("Number of red wine of deposit in oversampled data: ",
len(y_os[y_os['wine_type']==1]))
print("Proportion of white wine data in oversampled data: ",
len(y_os[y_os['wine_type']==0])/len(X_os))
print("Proportion of red wine data in oversampled data: ",
len(y_os[y_os['wine_type']==1])/len(X_os))
#Merge of X and y after oversampling to one dataset which is ready to modelling
data_modelling = pd.merge(y_os, X_os, left_index=True, right_index=True)
data_modelling
#Checking od distribution of target after oversampling
a = data_modelling["wine_type"].value_counts(normalize=True).round(3).to_frame()*100
a["wine_type distribution"] = data_modelling["wine_type"].value_counts().round(3).to_frame()
a.columns=["wine_type distribution in %", "wine_type distribution"]
a
#Visualization of distribution of target after oversampling
plt.figure(figsize=(18, 5))
ax=sns.countplot(x="wine_type",
data=data_modelling,
palette = ["ivory", "red"])
ax.set_title("Distribution of target variable - wine_type", fontsize=20)
plt.xlabel("wine type",fontsize=15)
plt.ylabel("count", fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
ax.set(ylim=(0, 5000))
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}',
(p.get_x()+0.2,
p.get_height()),
ha="center",
color="black",
size=18)
#Settings of legend
darkcyan_patch = mpatches.Patch(color="ivory", label= "white wine")
darkslategray_patch = mpatches.Patch(color="red", label= "red wine")
plt.legend(handles=[darkcyan_patch,
darkslategray_patch],
loc="best",
prop={"size": 15},
title="bank term deposit",
title_fontsize="15",
frameon=True)
plt.show()
#Saving dataset after modifications - dataset ready to modelling
data_modelling.to_csv("data_modelling.csv")
The upper left and lower right quarters contain correct classification results, and the lower left and upper right are incorrect. The left lower quarter contains false negative results and the right upper quarter contains false positive results.
def conf_matrix(model_name, y_test, pred_test):
"""
Function to plot configuration matrix with statistics of this metrics.
Input:
model_name: name of model for example: "XGBoost" / "KNN" / "Logistic Regression" and so on
y_test: target variable from test dataset
pred_test: prediction on test dataset
Output:
Confusion matrics with other related statistics.
"""
print("Confusion matrix of " + model_name)
CM = confusion_matrix(y_test, pred_test)
print(CM)
print("-"*40)
TN = CM[0][0]
FP = CM[0][1]
FN = CM[1][0]
TP = CM[1][1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
print("True Negative:", TN)
print("False Positive:", FP)
print("False Negative:", FN)
print("True Positive:", TP)
print("Correct Predictions", round((TN + TP) / len(pred_test) * 100, 2), "%")
print("-"*40)
print("The acuuracy of the model = TP+TN/(TP+TN+FP+FN) = ",round((TP+TN)/float(TP+TN+FP+FN),2),"\n",
"The Missclassification = 1-Accuracy = ",round(1-((TP+TN)/float(TP+TN+FP+FN)),2),"\n",
"Sensitivity or True Positive Rate = TP/(TP+FN) = ",round(TP/float(TP+FN),2),"\n",
"Specificity or True Negative Rate = TN/(TN+FP) = ",round(TN/float(TN+FP),2),"\n",
"Positive Predictive value = TP/(TP+FP) = ",round(TP/float(TP+FP),2),"\n",
"Negative predictive Value = TN/(TN+FN) = ",round(TN/float(TN+FN),2),"\n",
"Positive Likelihood Ratio = Sensitivity/(1-Specificity) = ",round(sensitivity/(1-specificity),2),"\n",
"Negative likelihood Ratio = (1-Sensitivity)/Specificity = ", round((1-sensitivity)/specificity,2))
def class_report(y_test, y_train, pred_test, pred_train, model_name):
"""
Function to generate classification report for both train and test dataset.
Input:
y_test: target variable from test dataset
y_train: target variables from train dataset
pred_test: predictions from test dataset
pred_train: predictions from train dataset
model_name: name of the model for exampe "XGBoost", "Random Forest" and so on...
"""
#Classification report on train and test datasets
print("Classification report of " + model_name + " on TRAIN dataset:")
print(classification_report(y_train, pred_train))
print("*"*55)
print("Classification report of " + model_name + " on TEST dataset:")
print(classification_report(y_test, pred_test))
Comparison of results on training and test sets. usually, the results on the training set will be better than on the test set, however, if results on the sets differ significantly, it means that model may be overfitted (too high a score on both sets can also indicate overfitting), if results are weak on both (train / test) datasets it means that model may be underfitted.
def stat_comparison(y_test, y_train, X_test, X_train, pred_test, pred_train, model):
"""
Function to generate DF with comparision of different statiscts on both train and test dataset.
Input:
y_test: target variable from test dataset
y_train: target variables from train dataset
X_test: independent variables from test dataset
X_train: independent variables from train dataset
pred_test: predictions from test dataset
pred_train: predictions from train dataset
model: built classifier
Output:
DataFrame with statistics on both train and test dataset.
"""
#TRAIN
accuracy_TRAIN = round(accuracy_score(y_train, pred_train),2)
recall_TRAIN = round(recall_score(y_train, pred_train),2)
precision_TRAIN = round(precision_score(y_train, pred_train),2)
f1_TRAIN = round(f1_score(y_train, pred_train),2)
y_prob_TRAIN = model.predict_proba(X_train)[::,1]
AUC_TRAIN = metrics.roc_auc_score(y_train, y_prob_TRAIN)
ginig_TRAIN = round((2*AUC_TRAIN) - 1,2)
#TEST
accuracy_TEST = round(accuracy_score(y_test, pred_test),2)
recall_TEST = round(recall_score(y_test, pred_test),2)
precision_TEST = round(precision_score(y_test, pred_test),2)
f1_TEST = round(f1_score(y_test, pred_test),2)
y_prob_TEST = model.predict_proba(X_test)[::,1]
AUC_TEST = metrics.roc_auc_score(y_test, y_prob_TEST)
ginig_TEST = round((2*AUC_TEST) - 1,2)
indicators = pd.DataFrame({"Dataset" : ["TRAIN", "TEST"],
"Accuracy" : [accuracy_TRAIN, accuracy_TEST],
"Precision" : [precision_TRAIN, precision_TEST],
"Recall" : [recall_TRAIN, recall_TEST],
"F1" : [f1_TRAIN, f1_TEST],
"AUC" : [AUC_TRAIN, AUC_TEST],
"Gini" : [ginig_TRAIN, ginig_TEST]}).set_index("Dataset")
print("Comparison of results on train and test dataset:")
return indicators
def plot_roc_cur(model, X, y, df, color, model_name):
"""
Funtion to plot ROC curve with value of AUC metric.
Input:
model: created model
X: X_train/test dataset
y: y_train/test dataset
df: name of dataset "train" / "test"
color: color of ROC plot
model_name: name of built model for exampe: XGboost / Random Forest / Logistic Regression and so on
Output:
ROC curve with value of AUC
"""
probs = model.predict_proba(X)[::,1]
auc = metrics.roc_auc_score(y, probs)
fper, tper, thresholds = roc_curve(y, probs)
plt.plot(fper, tper, label= model_name + " (auc = %0.3f)" % auc, color=color)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlabel('False Positive Rate', fontsize=15)
plt.ylabel('True Positive Rate', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
if df == "train":
plt.title('Receiver Operating Characteristic (ROC) Curve on TRAIN dataset', fontsize=20)
elif df == "test":
plt.title('Receiver Operating Characteristic (ROC) Curve on TEST dataset', fontsize=20)
else:
plt.title("CHECK CORRECT DATASET train / test!")
plt.legend(loc="best",
fontsize=15,
prop={"size": 14},
title="Area Under Curve (AUC)",
title_fontsize="16",
frameon=True)
plt.show()
#X and y from dataset
X_LR = data_modelling.loc[:, data.columns != "wine_type"]
y_LR = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_LR, X_test_LR, y_train_LR, y_test_LR = train_test_split(X_LR,
y_LR,
test_size = 0.1*k,
random_state = 2021)
#Scaling data
scaler = StandardScaler()
X_train_LR = scaler.fit_transform(X_train_LR)
X_test_LR = scaler.transform(X_test_LR)
#Logistic Regression model
LR = LogisticRegression()
LR.fit(X = X_train_LR, y = y_train_LR)
#Prediction on train dataset
prediction_train = LR.predict(X_train_LR)
#Prediction on test dataset
prediction_test = LR.predict(X_test_LR)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_LR, prediction_train), 3),
"Test AUC:", round(roc_auc_score(y_test_LR, prediction_test), 3))
Configuration test = 0.2 and train = 0.8 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#Train / Test split
X_train_LR, X_test_LR, y_train_LR, y_test_LR = train_test_split(X_LR,
y_LR,
train_size = 0.8,
test_size = 0.2,
random_state = 1)
# #Scaling data
# scaler = StandardScaler()
# X_train_LR = scaler.fit_transform(X_train_LR)
# X_test_LR = scaler.transform(X_test_LR)
#Hyperparameters tunning - GridSearch
#Combinations of hyperparameters
grid={"C":[0.001,0.01,0.1,1,10,100,1000],
"penalty":["none","l2", "elasticnet"],# l1 lasso l2 ridge
"solver" : ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]}
#Grid parameters
grid_search_LR = GridSearchCV(estimator = LogisticRegression(),
param_grid = grid,
verbose = 0)
#Training of GridSearch
grid_search_LR.fit(X_train_LR, y_train_LR)
#Best values of hiperparameters in Logistic Regression
best_parameters_LR = grid_search_LR.best_params_
best_parameters_LR
#Creating classifier with best hyper parameters
LR = LogisticRegression(C = 1,
penalty = 'l2',
solver = 'liblinear')
#Training Logistic Regressio model with best hyper parameters
LR = LR.fit(X = X_train_LR, y = y_train_LR)
#Prediction of train dataset
TRAIN_pred_LR = LR.predict(X_train_LR)
#Prediction on test dataset
TEST_pred_LR = LR.predict(X_test_LR)
#Activation of early build function to calculate confusion matrix
conf_matrix(model_name= "Logistic Regression",
y_test = y_test_LR,
pred_test = TEST_pred_LR)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_LR,
y_train = y_train_LR,
pred_test = TEST_pred_LR,
pred_train = TRAIN_pred_LR,
model_name = "Logistic Regression")
#Activation of early build function to compare statistics on train and test datasets
stat_comparison(y_test = y_test_LR,
y_train = y_train_LR,
X_test = X_test_LR,
X_train = X_train_LR,
pred_test = TEST_pred_LR,
pred_train = TRAIN_pred_LR,
model = LR)
#Activation of early build function to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = LR, X = X_train_LR, y = y_train_LR, df="train", color="blue", model_name = "Logistic Regression")
plot_roc_cur(model = LR, X = X_test_LR, y = y_test_LR, df="test", color="orange", model_name = "Logistic Regression")
#Probabilities
y_prob = LR.predict_proba(X_test_LR)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model catches red and white wines at the similar effectiveness.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_LR,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for Logistic Regression Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_LR,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for Logistic Regression Model')
plt.show()
#Convert numpy array to DataFrame
array_to_df = pd.DataFrame(X_train_LR)
#Coefficient of the features
lr_importances = np.abs(LR.coef_)[0]
lr_indices = np.argsort(lr_importances)[::-1]
#Features ranking
print("Features importance:")
print()
lr_labels = []
for f in range(array_to_df.shape[1]):
lr_labels.append(array_to_df.columns.values[lr_indices[f]])
print(lr_labels[f], round(lr_importances[lr_indices[f]],3))
#Bar plot of variable importance
#Convert numpy array to DataFrame
array_to_df = pd.DataFrame(X_test_LR)
plt.figure(figsize=(15,5))
plt.bar(range(array_to_df.shape[1]),
lr_importances[lr_indices],
color="b"
,align="center")
plt.title("Features importance in Logistic Regression", fontsize=17)
plt.xticks(range(array_to_df.shape[1]), lr_labels, fontsize=12)
plt.xlim([-1, array_to_df.shape[1]])
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = LR.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_Logistic_Regression.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_Logistic_Regression.xlsx")
#X and y from dataset
X_KNN = data_modelling.loc[:, data.columns != "wine_type"].columns.tolist()
y_KNN = data_modelling.loc[:, data.columns == "wine_type"].columns.tolist()
#Wrapper for cross validation with 5 folders - *args and **kwargs appear at the end - lists of parameters given as a dictionary
#or list.
def CVTestKNN(nFolds = 5, randomState=2020, debug=False, *args, **kwargs):
kf = KFold(n_splits=nFolds, shuffle=True, random_state=randomState)
#Lists with results
testResults = []
trainResults = []
predictions = []
indices = []
#Loop validating the model on successive folds
for train_KNN, test_KNN in kf.split(data_modelling.index.values):
# Przygotowanie estymatora
KNN = neighbors.KNeighborsClassifier(*args, **kwargs)
if debug:
print(KNN)
#Training model
KNN.fit(data_modelling.iloc[train_KNN][X_KNN], data_modelling.iloc[train_KNN][y_KNN])
#Preparing of predictions on train and test dataset
predictions_train = KNN.predict_proba(data_modelling.iloc[train_KNN][X_KNN])[:,1]
predictions_test = KNN.predict_proba(data_modelling.iloc[test_KNN][X_KNN])[:,1]
#Let's keep the prediction information for this fold
predictions.append(predictions_test.tolist().copy())
#Together with the indexes in the original data frame
indices.append(data_modelling.iloc[test_KNN].index.tolist().copy())
#Calculation of ROC-AUC for folds
trainScore = roc_auc_score((data_modelling[y_KNN].iloc[train_KNN]==1).astype(int), predictions_train)
testScore = roc_auc_score((data_modelling[y_KNN].iloc[test_KNN]==1).astype(int), predictions_test)
#Saving results for folds
trainResults.append(trainScore)
testResults.append(testScore)
#We can optionally display information about each folio along with the training results
if debug:
print("Train AUC:", trainScore,
"Valid AUC:", testScore)
return trainResults, testResults, predictions, indices
#Training on standard parameters
trainResults, testResults, predictions, indices = CVTestKNN(n_neighbors=5, n_jobs=-1, p=2, debug=True)
print(np.mean(trainResults), np.mean(testResults), testResults)
#Results for differenc n_neighbors parameter values nad p=2
for k in [1, 3, 5, 10, 15, 30, 50, 100, 150, 200]:
trainResults, testResults, predictions, indices = CVTestKNN(n_neighbors=k, n_jobs=-1, p=2)
print("n_neighbors, mean_train, mean_test")
print(k, np.mean(trainResults), np.mean(testResults))
#Results for differenc n_neighbors parameter values nad p=2
for k in [5, 10, 15, 20, 25, 30, 35, 40]:
trainResults, testResults, predictions, indices = CVTestKNN(n_neighbors=k, n_jobs=-1, p=2)
print("n_neighbors, mean_train, mean_test")
print(k, np.mean(trainResults), np.mean(testResults))
#Results for differenc n_neighbors parameter values nad p=1
for k in [15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200]:
trainResults, testResults, predictions, indices = CVTestKNN(n_neighbors=k, n_jobs=-1, p=1)
print("n_neighbors, mean_train, mean_test")
print(k, np.mean(trainResults), np.mean(testResults))
#Loop to find the best configuration of hyper parameters in cross validation with 5 folds
#Empty lists for hyper parameters from loop
nn_list = list()
p_list = list()
mean_TRAIN_list = list()
mean_TEST_list = list()
#Loop with hyper parameters - because of extremely long time of execution of tunning of hyper parameters in loop
#with a lot of hyper parameters, range of hyper parameters was reduced
print("n_neighbors || p || mean_test_result || mean_train_resul || train_test_difference")
print("=================================================================================")
for n_neighbors in [1, 3, 5, 10, 15, 25, 30, 35, 40, 50, 100, 150, 200]:
for p in [1, 2]:
trainResults, testResults, predictions, indices = CVTestKNN(debug=False,
n_neighbors=n_neighbors,
p=p,
n_jobs=-1)
#Append values from loop to lists
nn_list.append(n_neighbors)
p_list.append(p)
mean_TRAIN_list.append(np.mean(trainResults))
mean_TEST_list.append(np.mean(testResults))
#Display mean results for training set and test set from 5 folds in different hyper parameters config.
print(n_neighbors, "||",
p, "||",
np.mean(testResults), "||",
np.mean(trainResults), "||",
(np.mean(trainResults) - np.mean(testResults)))
# #Save results of hyperparameters tunning in Data Frame
df = pd.DataFrame()
df["n_neighbors"] = nn_list
df["p"] = p_list
df["mean_TEST"] = mean_TEST_list
df["mean_TRAIN"] = mean_TRAIN_list
df["TRAIN_TEST_difference"] = df["mean_TRAIN"] - df["mean_TEST"]
As we can see n_neighbors = 15 and p = 1 get the best results on test dataset, moreover results on test and train dataset are similar, what is also good. Nevertheless, results (AUC) are really huge and we can expect overfitting, what is likely because of really small input dataset.
#The best combination of hyper parameters in KNN model based on mean results on TEST dataset
df.sort_values(by="mean_TEST", ascending=False)
#X and y from dataset
X_KNN = data_modelling.loc[:, data.columns != "wine_type"]
y_KNN = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_KNN, X_test_KNN, y_train_KNN, y_test_KNN = train_test_split(X_KNN,
y_KNN,
test_size = 0.1*k,
random_state = 222)
# #Scaling data
# scaler = StandardScaler()
# X_train_KNN = scaler.fit_transform(X_train_KNN)
# X_test_KNN = scaler.transform(X_test_KNN)
#Logistic Regression model
KNN = neighbors.KNeighborsClassifier(n_neighbors = 15,
p =1,
n_jobs=-1)
KNN.fit(X = X_train_KNN, y = y_train_KNN)
#Prediction on train dataset
prediction_train_KNN = KNN.predict(X_train_KNN)
#Prediction on test dataset
prediction_test_KNN = KNN.predict(X_test_KNN)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_KNN, prediction_train_KNN), 3),
"Test AUC:", round(roc_auc_score(y_test_KNN, prediction_test_KNN), 3))
Configuration test = 0.2 and train = 0.8 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#X and y from dataset
X_KNN = data_modelling.loc[:, data.columns != "wine_type"]
y_KNN = data_modelling.loc[:, data.columns == "wine_type"]
#Split dataset to train and test
X_train_KNN, X_test_KNN, y_train_KNN, y_test_KNN = train_test_split(X_KNN,
y_KNN,
test_size = 0.2,
random_state = 222)
# #Scaling data
# scaler = StandardScaler()
# X_train_KNN = scaler.fit_transform(X_train_KNN)
# X_test_KNN = scaler.transform(X_test_KNN)
#Build and train KNN model
KNN = neighbors.KNeighborsClassifier(n_neighbors = 15, n_jobs=-1, p = 1)
KNN = KNN.fit(X = X_train_KNN, y = y_train_KNN)
#Predictions on train and test datasets
TRAIN_pred_KNN = KNN.predict(X_train_KNN)
TEST_pred_KNN = KNN.predict(X_test_KNN)
#Activation of early build function to calculate confusion matrix
conf_matrix(model_name = "KNN",
y_test = y_test_KNN,
pred_test = TEST_pred_KNN)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_KNN,
y_train = y_train_KNN,
pred_test = TEST_pred_KNN,
pred_train = TRAIN_pred_KNN,
model_name = "KNN")
#Activation of early build function to compare statistics on train adn test datasets
stat_comparison(y_test = y_test_KNN,
y_train = y_train_KNN,
X_test = X_test_KNN,
X_train = X_train_KNN,
pred_test = TEST_pred_KNN,
pred_train = TRAIN_pred_KNN,
model=KNN)
#Activation of early build function to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = KNN, X = X_test_KNN, y = y_test_KNN, df="test", color="black", model_name = "KNN")
plot_roc_cur(model = KNN, X = X_train_KNN, y = y_train_KNN, df="train", color="green", model_name = "KNN")
#Probabilities
y_prob = LR.predict_proba(X_test_KNN)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model catches red and white wines at the similar effectiveness.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_KNN,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for KNN Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_KNN,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for KNN Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = KNN.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_KNN.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_KNN.xlsx")
#X and y from dataset
X_SVM = data_modelling.loc[:, data.columns != "wine_type"].columns.tolist()
y_SVM = data_modelling.loc[:, data.columns == "wine_type"].columns.tolist()
#Wrapper for cross validation with 5 folders - *args and **kwargs appear at the end - lists of parameters given as a dictionary
#or list.
def CVTestSVM(nFolds = 5, randomState=2020, debug=False, *args, **kwargs):
kf = KFold(n_splits=nFolds, shuffle=True, random_state=randomState)
#Lists with results
testResults = []
trainResults = []
predictions = []
indexes = []
for train_SVM, test_SVM in kf.split(data_modelling.index.values):
#Preparing of estimator
SVM = SVC(probability=True, max_iter=-1, random_state=2020, tol=0.001, cache_size=500, *args, **kwargs)
#Display of function results
if debug:
print(SVM)
X = data_modelling.iloc[train_SVM]
#Training the model
SVM.fit(X[X_SVM], X[y_SVM])
#Predictions for train adn test datasets
predictions_train = SVM.predict_proba(data_modelling.iloc[train_SVM][X_SVM])[:,1]
predictions_test = SVM.predict_proba(data_modelling.iloc[test_SVM][X_SVM])[:,1]
# Let's keep the prediction information for this fold
predictions.append(predictions_test.tolist().copy())
#Together with the indexes in the original data frame
indexes.append(data_modelling.iloc[test_SVM].index.tolist().copy())
#Calculation of statistics on each fold
trainScore = roc_auc_score((data_modelling[y_SVM].iloc[train_SVM]==1).astype(int), predictions_train)
testScore = roc_auc_score((data_modelling[y_SVM].iloc[test_SVM]==1).astype(int), predictions_test)
#Saving results to lists
trainResults.append(trainScore)
testResults.append(testScore)
#We can optionally display information about each folio along with the training results
if debug:
print("Train AUC:", trainScore,
"Valid AUC:", testScore)
return trainResults, testResults, predictions, indexes
#Training of linear model (with default hyperparameter C)
trainResults, testResults, predictions, indexes = CVTestSVM(debug=False, kernel="linear")
#Display mean results for training set and test set from 5 folds
print(np.mean(trainResults),"***", np.mean(testResults), testResults)
#SVM with polynomial kernel, degree two
trainResults, testResults, predictions, indexes = CVTestSVM(debug=True, degree=2, kernel="poly")
#Display mean results for training set and test set from 5 folds
print(np.mean(trainResults),"***", np.mean(testResults))
#SVM with "rbf" kernel
trainResults, testResults, predictions, indexes = CVTestSVM(debug=True, kernel="rbf",)
#Display mean results for training set and test set from 5 folds
print(np.mean(trainResults),"***", np.mean(testResults))
#Loop to find the best configuration of hyper parameters in cross validation with 5 folds
#Empty lists for hyper parameters from loop
c_list = list()
kernel_list = list()
gamma_list = list()
decision_function_shape_list = list()
mean_TRAIN_list = list()
mean_TEST_list = list()
#Loop with hyper parameters - because of extremely long time of execution of tunning of hyper parameters in loop
#with a lot of hyper parameters, range of hyper parameters was reduced
print("C || kernel || gamma || decision_function_shape || mean_test_result || mean_train_resul || train_test_difference")
print("================================================================================================================")
for c in [0.01, 0.1, 0.25, 0.5, 1, 2, 5, 10, 25, 50, 100]:
for kernel in ['linear', 'poly', 'rbf', 'sigmoid']: #possible also:'precomputed'
for gamma in ['scale']: #possible aslo: 'auto'
for dec_fk_shp in ['ovr']: #possible also 'ovo'
trainResults, testResults, predictions, indices = CVTestSVM(debug=False,
kernel=kernel,
C=c,
gamma=gamma,
decision_function_shape=dec_fk_shp)
#Append values from loop to lists
c_list.append(c)
kernel_list.append(kernel)
gamma_list.append(gamma)
decision_function_shape_list.append(dec_fk_shp)
mean_TRAIN_list.append(np.mean(trainResults))
mean_TEST_list.append(np.mean(testResults))
#Display mean results for training set and test set from 5 folds in different hyper parameters config.
print(c, "||",
kernel, "||",
gamma,"||",
dec_fk_shp, "||",
np.mean(testResults), "||",
np.mean(trainResults), "||",
(np.mean(trainResults) - np.mean(testResults)))
# #Save results of hyperparameters tunning in Data Frame
df = pd.DataFrame()
df["C"] = c_list
df["kernel"] = kernel_list
df["gamma"] = gamma_list
df["decision_function_shape"] = decision_function_shape_list
df["mean_TEST"] = mean_TEST_list
df["mean_TRAIN"] = mean_TRAIN_list
df["TRAIN_TEST_difference"] = df["mean_TRAIN"] - df["mean_TEST"]
As we can see combination of hyper parameters: C=100, kernel="linear", gamma = "scale" and decision_function_shape="ovr" give the best results on the test dataset. Moreover results on test and train dataset are simillar, which is also good. Nevertheless, all dataset is really small, so unfortunately model will be probably overfitted.
#The best combination of hyper parameters in SVM model based on mean results on TEST dataset
df.sort_values(by="mean_TEST", ascending=False)
#X and y from dataset
X_SVM = data_modelling.loc[:, data.columns != "wine_type"]
y_SVM = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_SVM, X_test_SVM, y_train_SVM, y_test_SVM = train_test_split(X_SVM,
y_SVM,
test_size = 0.1*k,
random_state = 333)
# #Scaling data
# scaler = StandardScaler()
# X_train_SVM = scaler.fit_transform(X_train_SVM)
# X_test_SVM = scaler.transform(X_test_SVM)
#SVM model wtih hyper parameters after tunning
SVM = SVC(C=100,
kernel = "linear",
gamma="scale",
decision_function_shape="ovr")
SVM.fit(X = X_train_SVM, y = y_train_SVM)
#Prediction on train dataset
prediction_train_SVM = SVM.predict(X_train_SVM)
#Prediction on test dataset
prediction_test_SVM = SVM.predict(X_test_SVM)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_SVM, prediction_train_SVM), 3),
"Test AUC:", round(roc_auc_score(y_test_SVM, prediction_test_SVM), 3))
Configuration test = 0.2 and train = 0.8 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#X and y from dataset
X_SVM = data_modelling.loc[:, data.columns != "wine_type"]
y_SVM = data_modelling.loc[:, data.columns == "wine_type"]
#Split dataset to train and test
X_train_SVM, X_test_SVM, y_train_SVM, y_test_SVM = train_test_split(X_SVM,
y_SVM,
train_size = 0.8,
test_size = 0.2,
random_state = 444)
# #Scaling data
# scaler = StandardScaler()
# X_train_KNN = scaler.fit_transform(X_train_KNN)
# X_test_KNN = scaler.transform(X_test_KNN)
#Build and train SVM model with hyper parameters after tunning and choosing the best train test split combination
SVM = SVC(C=100,
kernel = "linear",
gamma="scale",
decision_function_shape="ovr",
probability=True)
SVM = SVM.fit(X = X_train_SVM, y = y_train_SVM)
#Predictions on train and test datasets
TRAIN_pred_SVM = SVM.predict(X_train_SVM)
TEST_pred_SVM = SVM.predict(X_test_SVM)
#Activation of early build function to calculate confusion matrix
conf_matrix(model_name = "SVM",
y_test = y_test_SVM,
pred_test = TEST_pred_SVM)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_SVM,
y_train = y_train_SVM,
pred_test = TEST_pred_SVM,
pred_train = TRAIN_pred_SVM,
model_name = "SVM")
#Activation of early build function to calculate statistics to compare train and test datasets
stat_comparison(y_test = y_test_SVM,
y_train = y_train_SVM,
X_test = X_test_SVM,
X_train = X_train_SVM,
pred_test = TEST_pred_SVM,
pred_train = TRAIN_pred_SVM,
model = SVM)
#Activation of early build function to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = SVM, X = X_test_SVM, y = y_test_SVM, df="test", color="orange", model_name = "SVM")
plot_roc_cur(model = SVM, X = X_train_SVM, y = y_train_SVM, df="train", color="green", model_name = "SVM")
#Probabilities
y_prob = SVM.predict_proba(X_test_SVM)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model slightly better catches red wines.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_SVM,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for SVM Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many red wines, slightly worse is in terms of white wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_SVM,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for SVM Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = SVM.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_SVM.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_SVM.xlsx")
#X and y from dataset
X_NB = data_modelling.loc[:, data.columns != "wine_type"]
y_NB = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_NB, X_test_NB, y_train_NB, y_test_NB = train_test_split(X_NB,
y_NB,
test_size = 0.1*k,
random_state = 555)
#NB model
NB = GaussianNB()
NB.fit(X = X_train_NB, y = y_train_NB)
#Prediction on train dataset
prediction_train_NB = NB.predict(X_train_NB)
#Prediction on test dataset
prediction_test_NB = NB.predict(X_test_NB)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_NB, prediction_train_NB), 3),
"Test AUC:", round(roc_auc_score(y_test_NB, prediction_test_NB), 3))
Configuration test = 0.5 and train = 0.5 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#X and y from dataset
X_NB = data_modelling.loc[:, data.columns != "wine_type"]
y_NB = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_NB, X_test_NB, y_train_NB, y_test_NB = train_test_split(X_NB,
y_NB,
train_size=0.5,
test_size = 0.5,
random_state = 666)
#NB model
NB = GaussianNB()
NB.fit(X = X_train_NB, y = y_train_NB)
#Predictions on train and test datasets
TRAIN_pred_NB = NB.predict(X_train_NB)
TEST_pred_NB = NB.predict(X_test_NB)
#Activation of early build function to calculate confusion matrix
conf_matrix(model_name = "NB",
y_test = y_test_NB,
pred_test = TEST_pred_NB)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_NB,
y_train = y_train_NB,
pred_test = TEST_pred_NB,
pred_train = TRAIN_pred_NB,
model_name = "NB")
#Activation of early build function to calculate statistics to compare train and test datasets
stat_comparison(y_test = y_test_NB,
y_train = y_train_NB,
X_test = X_test_NB,
X_train = X_train_NB,
pred_test = TEST_pred_NB,
pred_train = TRAIN_pred_NB,
model=NB)
#Activation of early build function to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = NB, X = X_test_NB, y = y_test_NB, df="test", color="red", model_name = "Naive Bayes")
plot_roc_cur(model = NB, X = X_train_NB, y = y_train_NB, df="train", color="brown", model_name = "Naive Bayes")
#Probabilities
y_prob = NB.predict_proba(X_test_NB)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model slightly better catches red wines.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_NB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for Naive Bayes Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many red wines, slightly worse is in terms of white wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_NB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for Naive Bayes Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = NB.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_Naive_Bayes.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_Naive_Bayes.xlsx")
#X and y from dataset
X_DT = data_modelling.loc[:, data.columns != "wine_type"].columns.tolist()
y_DT = data_modelling.loc[:, data.columns == "wine_type"].columns.tolist()
#Wrapper for cross validation with 5 folders - *args and **kwargs appear at the end - lists of parameters given as a dictionary
#or list.
def CVTestRFClass(nFolds = 5, randomState=2020, debug=False, *args, **kwargs):
kf = KFold(n_splits=nFolds, shuffle=True, random_state=randomState)
#Lists for results
testResults = []
trainResults = []
predictions = []
indices = []
#Loop validating the model on successive folds
for train_DT, test_DT in kf.split(data_modelling.index.values):
#Preparing of estimator
DT = DecisionTreeClassifier(*args, **kwargs, random_state=randomState)
if debug:
print(DT)
#Training model
DT.fit(data_modelling.iloc[train_DT][X_DT], data_modelling.iloc[train_DT][y_DT])
#Predictions for train adn test datasets
predictions_train = DT.predict_proba(data_modelling.iloc[train_DT][X_DT])[:,1]
predictions_test = DT.predict_proba(data_modelling.iloc[test_DT][X_DT])[:,1]
#Let's keep the prediction information for this fold
predictions.append(predictions_test.tolist().copy())
#Together with the indexes in the original data frame
indices.append(data_modelling.iloc[test_DT].index.tolist().copy())
#Calculation of ROC-AUC
trainScore = roc_auc_score((data_modelling[y_DT].iloc[train_DT]==1).astype(int), predictions_train)
testScore = roc_auc_score((data_modelling[y_DT].iloc[test_DT]==1).astype(int), predictions_test)
#Saving results to lists
trainResults.append(trainScore)
testResults.append(testScore)
#We can optionally display information about each folio along with the training results
if debug:
print("Train AUC:", trainScore,
"Valid AUC:", testScore)
return trainResults, testResults, predictions, indices
#Training of linear model (with default hyperparameter C)
trainResults, testResults, predictions, indexes = CVTestRFClass()
#Display mean results for training set and test set from 5 folds
print(np.mean(trainResults),"***", np.mean(testResults))
#Loop to find the best configuration of hyper parameters in cross validation with 5 folds
#Empty lists for hyper parameters from loop
criterion_list = list()
splitter_list = list()
max_depth_list = list()
mss_list = list()
msl_list = list()
max_features_list = list()
mean_TRAIN_list = list()
mean_TEST_list = list()
#Loop with hyper parameters - because of extremely long time of execution of tunning of hyper parameters in loop
#with a lot of hyper parameters, range of hyper parameters was reduced
print("criterion || splitter || max_depth || min_samples_split || min_samples_leaf || max_features || mean_test_result || mean_train_resul || train_test_difference")
print("============================================================================================================================================================")
for criterion in ["gini", "entropy"]:
for splitter in ["best", "random"]:
for max_depth in [3, 4, 5, 10, 12, 15, 20]:
for min_samples_split in [2, 3, 4]:
for min_samples_leaf in [1, 2, 3, 4]:
for max_features in ["auto", "sqrt", "log2"]:
trainResults, testResults, predictions, indices = CVTestRFClass(debug=False,
criterion=criterion,
splitter=splitter,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf= min_samples_leaf,
max_features=max_features)
#Append values from loop to lists
criterion_list.append(criterion)
splitter_list.append(splitter)
max_depth_list.append(max_depth)
mss_list.append(min_samples_split)
msl_list.append(min_samples_leaf)
max_features_list.append(max_features)
mean_TRAIN_list.append(np.mean(trainResults))
mean_TEST_list.append(np.mean(testResults))
#Display mean results for training set and test set from 5 folds in different hyper parameters config.
print(criterion, "||",
splitter, "||",
max_depth,"||",
min_samples_split, "||",
min_samples_leaf, "||",
max_features, "||",
np.mean(testResults), "||",
np.mean(trainResults), "||",
(np.mean(trainResults) - np.mean(testResults)))
# #Save results of hyperparameters tunning in Data Frame
df = pd.DataFrame()
df["criterion"] = criterion_list
df["splitter"] = splitter_list
df["max_depth"] = max_depth_list
df["min_samples_split"] = mss_list
df["min_samples_leaf"] = msl_list
df["max_features"] = max_features_list
df["mean_TEST"] = mean_TEST_list
df["mean_TRAIN"] = mean_TRAIN_list
df["TRAIN_TEST_difference"] = df["mean_TRAIN"] - df["mean_TEST"]
#The best combination of hyper parameters in Decision Tree model based on mean results on TEST dataset
df.sort_values(by="mean_TEST", ascending=False)
#X and y from dataset
X_DT = data_modelling.loc[:, data.columns != "wine_type"]
y_DT = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_DT, X_test_DT, y_train_DT, y_test_DT = train_test_split(X_DT,
y_DT,
test_size = 0.1*k,
random_state = 777)
#Decision Tree model wtih hyper parameters after tunning
DT = DecisionTreeClassifier(criterion = "gini",
splitter = "random",
max_depth = 12,
min_samples_split = 2,
min_samples_leaf = 3,
max_features = "sqrt")
DT.fit(X = X_train_DT, y = y_train_DT)
#Prediction on train dataset
prediction_train_DT = DT.predict(X_train_DT)
#Prediction on test dataset
prediction_test_DT = DT.predict(X_test_DT)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_DT, prediction_train_DT), 3),
"Test AUC:", round(roc_auc_score(y_test_DT, prediction_test_DT), 3))
Configuration test = 0.4 and train = 0.6 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#X and y from dataset
X_DT = data_modelling.loc[:, data.columns != "wine_type"]
y_DT = data_modelling.loc[:, data.columns == "wine_type"]
#Split dataset to train and test
X_train_DT, X_test_DT, y_train_DT, y_test_DT = train_test_split(X_DT,
y_DT,
train_size = 0.6,
test_size = 0.4,
random_state = 888)
#Build and train dECISION tREE model with hyper parameters after tunning and choosing the best train test split combination
DT = DecisionTreeClassifier(criterion = "gini",
splitter = "random",
max_depth = 12,
min_samples_split = 2,
min_samples_leaf = 3,
max_features = "sqrt")
DT = DT.fit(X = X_train_DT, y = y_train_DT)
#Predictions on train and test datasets
TRAIN_pred_DT = DT.predict(X_train_DT)
TEST_pred_DT = DT.predict(X_test_DT)
#Activation of early build function to calculate cofusion matrix
conf_matrix(model_name = "Decision Tree",
y_test = y_test_DT,
pred_test = TEST_pred_DT)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_DT,
y_train = y_train_DT,
pred_test = TEST_pred_DT,
pred_train = TRAIN_pred_DT,
model_name = "Decision Tree")
#Activation of early build function to calculate statistics to compare train and test datasets
stat_comparison(y_test = y_test_DT,
y_train = y_train_DT,
X_test = X_test_DT,
X_train = X_train_DT,
pred_test = TEST_pred_DT,
pred_train = TRAIN_pred_DT,
model = DT)
#Activation of early build function to calculate statistics to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = DT, X = X_test_DT, y = y_test_DT, df="test", color="blue", model_name = "Decision Tree")
plot_roc_cur(model = DT, X = X_train_DT, y = y_train_DT, df="train", color="brown", model_name = "Decision Tree")
#Probabilities
y_prob = NB.predict_proba(X_test_NB)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model slightly better catches red wines.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_NB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for Decision Tree Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many red wines, slightly worse is in terms of white wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_NB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for Decision Tree Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = DT.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_Decision_Tree.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_Decision_Tree.xlsx")
#X and y from dataset
X_RF = data_modelling.loc[:, data.columns != "wine_type"].columns.tolist()
y_RF = data_modelling.loc[:, data.columns == "wine_type"].columns.tolist()
#Wrapper for cross validation with 5 folders - *args and **kwargs appear at the end - lists of parameters given as a dictionary
#or list.
def CVTestRFClass(nFolds = 5, randomState=2020, debug=False, *args, **kwargs):
kf = KFold(n_splits=nFolds, shuffle=True, random_state=randomState)
#Lists for results
testResults = []
trainResults = []
predictions = []
indices = []
#Loop validating the model on successive folds
for train_RF, test_RF in kf.split(data_modelling.index.values):
#Preparation of estimator
RF = RandomForestClassifier(*args, **kwargs, random_state=randomState, n_jobs=-1)
if debug:
print(RF)
#Model training
RF.fit(data_modelling.iloc[train_RF][X_RF], data_modelling.iloc[train_RF][y_RF])
#Predictions for train and test datasets
predictions_train = RF.predict_proba(data_modelling.iloc[train_RF][X_RF])[:,1]
predictions_test = RF.predict_proba(data_modelling.iloc[test_RF][X_RF])[:,1]
#Let's keep the prediction information for this folio
predictions.append(predictions_test.tolist().copy())
#Together with the indexes in the original data frame
indices.append(data_modelling.iloc[test_RF].index.tolist().copy())
#Calculation of ROC-AUC
trainScore = roc_auc_score((data_modelling[y_RF].iloc[train_RF]==1).astype(int), predictions_train)
testScore = roc_auc_score((data_modelling[y_RF].iloc[test_RF]==1).astype(int), predictions_test)
#Saving results to lists
trainResults.append(trainScore)
testResults.append(testScore)
#We can optionally display information about each folio along with the training results
if debug:
print("Train AUC:", trainScore,
"Valid AUC:", testScore)
return trainResults, testResults, predictions, indices
trainResults, testResults, predictions, indices = CVTestRFClass(debug=True)
print(np.mean(testResults),"**", np.mean(trainResults))
#Loop to find the best configuration of hyper parameters in cross validation with 5 folds
#Empty lists for hyper parameters from loop
criterion_list = list()
max_depth_list = list()
mss_list = list()
msl_list = list()
max_features_list = list()
bootstrap_list = list()
estimators_list = list()
mean_TRAIN_list = list()
mean_TEST_list = list()
#Loop with hyper parameters - because of extremely long time of execution of tunning of hyper parameters in loop
#with a lot of hyper parameters, range of hyper parameters was reduced
print("criterion || max_depth || min_samples_split || min_samples_leaf || max_features || n_estimators || bootstrap || mean_test || mean_train || train_test_difference")
print("================================================================================================================================================================")
for criterion in ["gini", "entropy"]:
for max_depth in [3, 5, 10, 15, 20]:
for min_samples_split in [2, 3, 4]:
for min_samples_leaf in [1, 2, 3]:
for max_features in ["auto", "sqrt", "log2"]:
for n_estimators in [10, 50, 200, 500, 1000]:
for bootstrap in [True]:
trainResults, testResults, predictions, indices = CVTestRFClass(debug=False,
criterion=criterion,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf= min_samples_leaf,
max_features=max_features,
n_estimators = n_estimators,
bootstrap = bootstrap)
#Append values from loop to lists
criterion_list.append(criterion)
splitter_list.append(splitter)
max_depth_list.append(max_depth)
mss_list.append(min_samples_split)
msl_list.append(min_samples_leaf)
max_features_list.append(max_features)
estimators_list.append(n_estimators)
bootstrap_list.append(bootstrap)
mean_TRAIN_list.append(np.mean(trainResults))
mean_TEST_list.append(np.mean(testResults))
#Display mean results for training set and test set from 5 folds in different hyper parameters config.
print(criterion, "||",
max_depth,"||",
min_samples_split, "||",
min_samples_leaf, "||",
max_features, "||",
n_estimators, "||",
bootstrap, "||",
np.mean(testResults), "||",
np.mean(trainResults), "||",
(np.mean(trainResults) - np.mean(testResults)))
# #Save results of hyperparameters tunning in Data Frame
df = pd.DataFrame()
df["criterion"] = criterion_list
df["max_depth"] = max_depth_list
df["min_samples_split"] = mss_list
df["min_samples_leaf"] = mss_list
df["max_features"] = max_features_list
df["n_estimators"] = estimators_list
df["bootstrap"] = bootstrap_list
df["mean_TEST"] = mean_TEST_list
df["mean_TRAIN"] = mean_TRAIN_list
df["TRAIN_TEST_difference"] = df["mean_TRAIN"] - df["mean_TEST"]
#The best combination of hyper parameters in Random Forest model based on mean results on TEST dataset
df.sort_values(by="mean_TEST", ascending=False)
#X and y from dataset
X_RF = data_modelling.loc[:, data.columns != "wine_type"]
y_RF = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
for k in range(1, 10):
X_train_RF, X_test_RF, y_train_RF, y_test_RF = train_test_split(X_RF,
y_RF,
test_size = 0.1*k,
random_state = 888)
#Random Forest model wtih hyper parameters after tunning
RF = RandomForestClassifier(criterion = "entropy",
max_depth = 20,
min_samples_split = 4,
min_samples_leaf = 4,
max_features = "log2",
n_estimators = 50,
bootstrap = True)
RF.fit(X = X_train_RF, y = y_train_RF)
#Prediction on train dataset
prediction_train_RF = RF.predict(X_train_RF)
#Prediction on test dataset
prediction_test_RF = RF.predict(X_test_RF)
#Printing results
print(f"test: {k/10}, Train AUC:", round(roc_auc_score(y_train_RF, prediction_train_RF), 3),
"Test AUC:", round(roc_auc_score(y_test_RF, prediction_test_RF), 3))
Configuration test = 0.2 and train = 0.8 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#X and y from dataset
X_RF = data_modelling.loc[:, data.columns != "wine_type"]
y_RF = data_modelling.loc[:, data.columns == "wine_type"]
#Split dataset to train and test
X_train_RF, X_test_RF, y_train_RF, y_test_RF = train_test_split(X_RF,
y_RF,
train_size = 0.8,
test_size = 0.2,
random_state = 999)
#Build and train Random Forest model with hyper parameters after tunning and choosing the best train test split combination
RF = RandomForestClassifier(criterion = "entropy",
max_depth = 20,
min_samples_split = 4,
min_samples_leaf = 4,
max_features = "log2",
n_estimators = 50,
bootstrap = True)
RF = RF.fit(X = X_train_RF, y = y_train_RF)
#Predictions on train and test datasets
TRAIN_pred_RF = RF.predict(X_train_RF)
TEST_pred_RF = RF.predict(X_test_RF)
#Activation of early build function to calculate cofusion matrix
conf_matrix(model_name = "Random Forest",
y_test = y_test_RF,
pred_test = TEST_pred_RF)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_RF,
y_train = y_train_RF,
pred_test = TEST_pred_RF,
pred_train = TRAIN_pred_RF,
model_name = "Random Forest")
#Activation of early build function to calculate statistics to compare train and test datasets
stat_comparison(y_test = y_test_RF,
y_train = y_train_RF,
X_test = X_test_RF,
X_train = X_train_RF,
pred_test = TEST_pred_RF,
pred_train = TRAIN_pred_RF,
model=RF)
#Activation of early build function to calculate statistics to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = RF, X = X_test_RF, y = y_test_RF, df="test", color="green", model_name = "Random Forest")
plot_roc_cur(model = RF, X = X_train_RF, y = y_train_RF, df="train", color="red", model_name = "Random Forest")
#Probabilities
y_prob = RF.predict_proba(X_test_RF)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model catches red and white wines at the similar effectiveness.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_RF,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for Random Forest Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many red wines, slightly worse is in terms of white wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_RF,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for Random Forest Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = RF.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_Random_Forest.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_Random_Forest.xlsx")
#X and y from dataset
X_XGB = data_modelling.loc[:, data.columns != "wine_type"].columns.tolist()
y_XGB = data_modelling.loc[:, data.columns == "wine_type"].columns.tolist()
#Wrapper for cross validation with 5 folders - *args and **kwargs appear at the end - lists of parameters given as a dictionary
#or list.
def CVTestXGB(nFolds = 5, randomState=2020, debug=False, *args, **kwargs):
kf = KFold(n_splits=nFolds, shuffle=True, random_state=randomState)
#Lists for results
testResults = []
trainResults = []
predictions = []
indices = []
#Loop validating the model on successive folds
for train_XGB, test_XGB in kf.split(data_modelling.index.values):
#Preparation of estimator
XGB = XGBClassifier(*args, **kwargs, random_state=randomState, n_jobs=-1, verbosity=0)
if debug:
print(XGB)
#Model training
XGB.fit(data_modelling.iloc[train_XGB][X_XGB], data_modelling.iloc[train_XGB][y_XGB])
#Predictions for train and test datasets
predictions_train = XGB.predict_proba(data_modelling.iloc[train_XGB][X_XGB])[:,1]
predictions_test = XGB.predict_proba(data_modelling.iloc[test_XGB][X_XGB])[:,1]
#Let's keep the prediction information for this fold
predictions.append(predictions_test.tolist().copy())
#Together with the indexes in the original data frame
indices.append(data_modelling.iloc[test_XGB].index.tolist().copy())
#Calculation of ROC-AUC
trainScore = roc_auc_score((data_modelling[y_XGB].iloc[train_XGB]==1).astype(int), predictions_train)
testScore = roc_auc_score((data_modelling[y_XGB].iloc[test_XGB]==1).astype(int), predictions_test)
#Saving results to lists
trainResults.append(trainScore)
testResults.append(testScore)
#We can optionally display information about each folio along with the training results
if debug:
print("Train AUC:", trainScore,
"Valid AUC:", testScore)
return trainResults, testResults, predictions, indices
#Loop to find the best configuration of hyper parameters in cross validation with 5 folds
#Empty lists for hyper parameters from loop
eta_list = list()
max_depth_list = list()
subsample_list = list()
colsample_bytree_list = list()
colsample_bylevel_list = list()
gamma_list = list()
min_child_weight_list = list()
rate_drop_list = list()
skip_drop_list = list()
mean_TRAIN_list = list()
mean_TEST_list = list()
#Loop with hyper parameters - because of extremely long time of execution of tunning of hyper parameters in loop
#with a lot of hyper parameters, range of hyper parameters was reduced
print("eta || max_depth || subsample || colsample_bytree || colsample_bylevel || gamma || min_child_weight || rate_drop || skip_drop || mean_test || mean_train || train_test_difference")
print("=================================================================================================================================================================================")
for eta in [0.01, 0.02, 0.03, 0.5]:
for max_depth in [5, 10, 20]:
for subsample in [0.7, 1]:
for colsample_bytree in [0.7, 1]:
for colsample_bylevel in [0.7, 1]:
for gamma in [0, 5]:
for min_child_weight in [0, 1]:
for rate_drop in [0, 0.2]:
for skip_drop in [0, 0.5]:
trainResults, testResults, predictions, indices = CVTestXGB(
debug=False,
eta=eta,
max_depth=max_depth,
subsample=subsample,
colsample_bytree=colsample_bytree,
colsample_bylevel=colsample_bylevel,
gamma=gamma,
min_child_weight=min_child_weight,
rate_drop=rate_drop,
skip_drop=skip_drop)
#Append values from loop to lists
eta_list.append(eta)
max_depth_list.append(max_depth)
subsample_list.append(subsample)
colsample_bytree_list.append(colsample_bytree)
colsample_bylevel_list.append(colsample_bylevel)
gamma_list.append(gamma)
min_child_weight_list.append(min_child_weight)
rate_drop_list.append(rate_drop)
skip_drop_list.append(skip_drop)
mean_TRAIN_list.append(np.mean(trainResults))
mean_TEST_list.append(np.mean(testResults))
#Display mean results for training set and test set from 5 folds in
#different hyper parameters config.
print(eta, "||",
max_depth,"||",
subsample, "||",
colsample_bytree, "||",
colsample_bylevel, "||",
gamma, "||",
min_child_weight, "||",
rate_drop, "||",
skip_drop, "||",
np.mean(testResults), "||",
np.mean(trainResults), "||",
(np.mean(trainResults) - np.mean(testResults)))
# #Save results of hyperparameters tunning in Data Frame
df = pd.DataFrame()
df["eta"] = eta_list
df["max_depth"] = max_depth_list
df["subsample"] = subsample_list
df["colsample_bytree"] = colsample_bytree_list
df["colsample_bylevel"] = colsample_bylevel_list
df["gamma"] = gamma_list
df["min_child_weight"] = min_child_weight_list
df["rate_drop"] = rate_drop_list
df["skip_drop"] = skip_drop_list
df["mean_TEST"] = mean_TEST_list
df["mean_TRAIN"] = mean_TRAIN_list
df["TRAIN_TEST_difference"] = df["mean_TRAIN"] - df["mean_TEST"]
As we can see on the below Data Frame, the first row presents the best combination of hiper parameters in XGBoost model, this combination give the best results (AUC) on test dataset with similar result on train dataset. Nevertheles results are really huge, so it may be overfitting, but input dataset is really small, so overfitting is likely.
#The best combination of hyper parameters in XGBoost model based on mean results on TEST dataset
df.sort_values(by="mean_TEST", ascending=False)
#X and y from dataset
X_XGB = data_modelling.loc[:, data.columns != "wine_type"]
y_XGB = data_modelling.loc[:, data.columns == "wine_type"]
#Loop to find optimal train / test split
k_list = list()
test_AUC_list = list()
train_AUC_list = list()
for k in range(1, 10):
X_train_XGB, X_test_XGB, y_train_XGB, y_test_XGB = train_test_split(X_XGB,
y_XGB,
test_size = 0.1*k,
random_state = 1010)
#Random Forest model wtih hyper parameters after tunning
XGB = xgb.sklearn.XGBClassifier(eta = 0.5,
max_depth=5,
subsample=0.7,
colsample_bytree=0.7,
colsample_bylevel=0.1,
gamma=0,
min_child_weight=1,
rate_drop=0,
skip_drop=0.5,
verbosity=0)
XGB.fit(X = X_train_XGB, y = y_train_XGB)
#Prediction on train dataset
prediction_train_XGB = XGB.predict(X_train_XGB)
#Prediction on test dataset
prediction_test_XGB = XGB.predict(X_test_XGB)
k_list.append(k)
test_AUC_list.append(round(roc_auc_score(y_test_XGB, prediction_test_XGB), 3))
train_AUC_list.append(round(roc_auc_score(y_train_XGB, prediction_train_XGB), 3))
df = pd.DataFrame()
df["test_size"] = [x/10 for x in k_list]
df["train_size"] = 1 - df["test_size"]
df["test_AUC"] = test_AUC_list
df["train_AUC"] = train_AUC_list
Configuration test = 0.3 and train = 0.7 gives the highest AUC on test dataset, moreover AUC results on train adn test datasets in this configuration gives similar results.
#Data Frame with hiper parameters
df
#X and y from dataset
X_XGB = data_modelling.loc[:, data.columns != "wine_type"]
y_XGB = data_modelling.loc[:, data.columns == "wine_type"]
#Split dataset to train and test
X_train_XGB, X_test_XGB, y_train_XGB, y_test_XGB = train_test_split(X_XGB,
y_XGB,
train_size = 0.7,
test_size = 0.3,
random_state = 12121)
#Build and train Random Forest model with hyper parameters after tunning and choosing the best train test split combination
XGB = xgb.sklearn.XGBClassifier(eta = 0.5,
max_depth=5,
subsample=0.7,
colsample_bytree=0.7,
colsample_bylevel=0.1,
gamma=0,
min_child_weight=1,
rate_drop=0,
skip_drop=0.5,
verbosity=0)
XGB = XGB.fit(X = X_train_XGB, y = y_train_XGB)
#Predictions on train and test datasets
TRAIN_pred_XGB = XGB.predict(X_train_XGB)
TEST_pred_XGB = XGB.predict(X_test_XGB)
#Features imporatance on XGBoost model
rcParams["figure.figsize"] = 18,5
plot_importance(XGB, height=0.7, color="red", title="Feature importance in XGBoost")
pyplot.show()
#Activation of early build function to calculate cofusion matrix
conf_matrix(model_name = "XGBoost",
y_test = y_test_XGB,
pred_test = TEST_pred_XGB)
#Activation of early build function to calculate classification report
class_report(y_test = y_test_XGB,
y_train = y_train_XGB,
pred_test = TEST_pred_XGB,
pred_train = TRAIN_pred_XGB,
model_name = "XGBoost")
#Activation of early build function to calculate statistics to compare train and test datasets
stat_comparison(y_test = y_test_XGB,
y_train = y_train_XGB,
X_test = X_test_XGB,
X_train = X_train_XGB,
pred_test = TEST_pred_XGB,
pred_train = TRAIN_pred_XGB,
model = XGB)
#Activation of early build function to calculate statistics to plot ROC curves
rcParams["figure.figsize"] = 18,5
plot_roc_cur(model = XGB, X = X_test_XGB, y = y_test_XGB, df="test", color="black", model_name = "XGBoost")
plot_roc_cur(model = XGB, X = X_train_XGB, y = y_train_XGB, df="train", color="orange", model_name = "XGBoost")
#Probabilities
y_prob = XGB.predict_proba(X_test_XGB)
The graph shows what percentage of red and white wines the model catches in a given percentage of the scoring list. For example, in the top 10% of the list of both red and white wines with the highest score, the model catches 30% of the correct wine classifications. Model catches red and white wines at the similar effectiveness.
#PROFIT curve
skplt.metrics.plot_cumulative_gain(y_test_XGB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='PROFIT curve - Cumulative Gains Curve for XGBoost Model')
plt.show()
The graph illustrates how many times the model is more effective at capturing wine type relative to classifying wines without using the model. For example, in the top 10% of the list of red and white wines with the highest score, the model correctly catches 2 times as many red wines, slightly worse is in terms of white wines.
#LIFT curve
skplt.metrics.plot_lift_curve(y_test_XGB,
y_prob,
figsize=(15,5),
title_fontsize=16,
text_fontsize=10,
title='LIFT curve for XGBoost Model')
plt.show()
#Results on full dataset (probabilities of 0 and 1)
X_all = data_modelling.loc[:, data.columns != "wine_type"].values
y_all_prob = XGB.predict_proba(X_all)
#Main DF and DF with probabilities
df_X_all = data_modelling
df_y_all_prob = pd.DataFrame(y_all_prob)
#Concatenation DF and probabilities
df_X_all_scored = pd.concat([df_X_all, df_y_all_prob * 100], axis = 1)
df_X_all_scored.rename(columns={0 : "Prob_0_white_wine", 1 : "Prob_1_red_wine"}, inplace=True)
df_X_all_scored
#Save columns with wine type and probabilities to Excel file
df_X_all_scored[["wine_type", "Prob_0_white_wine", "Prob_1_red_wine"]].to_excel("probabilities_results/prob_XGBoost.xlsx")
#Save all dataset with probabilities to Excel file
df_X_all_scored.to_excel("probabilities_results/all_df_XGBoost.xlsx")
#Results on test datasets
#Logistic Regression
accuracy_LR = accuracy_score(y_test_LR, TEST_pred_LR)
recall_LR = recall_score(y_test_LR, TEST_pred_LR)
precision_LR = precision_score(y_test_LR, TEST_pred_LR)
f1_LR = f1_score(y_test_LR, TEST_pred_LR)
AUC_LR = metrics.roc_auc_score(y_test_LR, model.predict_proba(X_test_LR)[::,1])
Gini_LR = (2*AUC_LR) - 1
#KNN
accuracy_KNN = accuracy_score(y_test_KNN, TEST_pred_KNN)
recall_KNN = recall_score(y_test_KNN, TEST_pred_KNN)
precision_KNN = precision_score(y_test_KNN, TEST_pred_KNN)
f1_KNN = f1_score(y_test_KNN, TEST_pred_KNN)
AUC_KNN = metrics.roc_auc_score(y_test_KNN, model.predict_proba(X_test_KNN)[::,1])
Gini_KNN = (2*AUC_KNN) - 1
# #SVM
accuracy_SVM = accuracy_score(y_test_SVM, TEST_pred_SVM)
recall_SVM = recall_score(y_test_SVM, TEST_pred_SVM)
precision_SVM = precision_score(y_test_SVM, TEST_pred_SVM)
f1_SVM = f1_score(y_test_SVM, TEST_pred_SVM)
AUC_SVM = metrics.roc_auc_score(y_test_SVM, model.predict_proba(X_test_SVM)[::,1])
Gini_SVM = (2*AUC_SVM) - 1
#Naive Bayes
accuracy_NB = accuracy_score(y_test_NB, TEST_pred_NB)
recall_NB = recall_score(y_test_NB, TEST_pred_NB)
precision_NB = precision_score(y_test_NB, TEST_pred_NB)
f1_NB = f1_score(y_test_NB, TEST_pred_NB)
AUC_NB = metrics.roc_auc_score(y_test_NB, model.predict_proba(X_test_NB)[::,1])
Gini_NB = (2*AUC_NB) - 1
#Decision Tree
accuracy_DT = accuracy_score(y_test_DT, TEST_pred_DT)
recall_DT = recall_score(y_test_DT, TEST_pred_DT)
precision_DT = precision_score(y_test_DT, TEST_pred_DT)
f1_DT = f1_score(y_test_DT, TEST_pred_DT)
AUC_DT = metrics.roc_auc_score(y_test_DT, model.predict_proba(X_test_DT)[::,1])
Gini_DT = (2*AUC_DT) - 1
#Random Forest
accuracy_RF = accuracy_score(y_test_RF, TEST_pred_RF)
recall_RF = recall_score(y_test_DT, TEST_pred_DT)
precision_RF = precision_score(y_test_RF, TEST_pred_RF)
f1_RF = f1_score(y_test_RF, TEST_pred_RF)
AUC_RF = metrics.roc_auc_score(y_test_RF, model.predict_proba(X_test_RF)[::,1])
Gini_RF = (2*AUC_RF) - 1
#XGBoost
accuracy_XGB = accuracy_score(y_test_XGB, TEST_pred_XGB)
recall_XGB = recall_score(y_test_XGB, TEST_pred_XGB)
precision_XGB = precision_score(y_test_XGB, TEST_pred_XGB)
f1_XGB = f1_score(y_test_XGB, TEST_pred_XGB)
AUC_XGB = metrics.roc_auc_score(y_test_XGB, model.predict_proba(X_test_XGB)[::,1])
Gini_XGB = (2*AUC_XGB) - 1
#Data Frame with statistics of model on test dataset to compare results of models
statistics_comparision_df = pd.DataFrame()
statistics_comparision_df["MODEL"] = ["Logistic Regression", "KNN", "SVM","Naive Bayes",
"Decision Tree", "Random Forest", "XGBoost"]
statistics_comparision_df["Accuracy"] = [accuracy_LR, accuracy_KNN, accuracy_SVM, accuracy_NB,
accuracy_DT, accuracy_RF, accuracy_XGB]
statistics_comparision_df["Precision"] = [precision_LR, precision_KNN, precision_SVM, precision_NB,
precision_DT, precision_RF, precision_XGB]
statistics_comparision_df["Recall"] = [recall_LR, recall_KNN, recall_SVM, recall_NB,
recall_DT, recall_RF, recall_XGB]
statistics_comparision_df["F1"] = [f1_LR, f1_KNN, f1_SVM, f1_NB,
f1_DT, f1_RF, f1_XGB]
statistics_comparision_df["AUC"] = [AUC_LR, AUC_KNN, AUC_SVM, AUC_NB,
AUC_DT, AUC_RF, AUC_XGB]
statistics_comparision_df["Gini"] = [Gini_LR, Gini_KNN, Gini_SVM, Gini_NB,
Gini_DT, Gini_RF, Gini_XGB]
#Add index and sort by Accuracy
statistics_comparision_df.set_index("MODEL", inplace=True)
statistics_comparision_df.sort_values(by="Accuracy", ascending=False, inplace=True)
statistics_comparision_df.to_excel("models_comparision/models_comparision.xlsx")
statistics_comparision_df
#Comparision of models based on ROC curve based on test datasets
#Logistic Regression
y_pred_prob1 = LR.predict_proba(X_test_LR)[:,1]
auc1 = metrics.roc_auc_score(y_test_LR, y_pred_prob1)
fpr1 , tpr1, thresholds1 = roc_curve(y_test_LR, y_pred_prob1)
#KNN
y_pred_prob2 = KNN.predict_proba(X_test_KNN)[:,1]
auc2 = metrics.roc_auc_score(y_test_KNN, y_pred_prob2)
fpr2 , tpr2, thresholds2 = roc_curve(y_test_KNN, y_pred_prob2)
#SVM
y_pred_prob3 = SVM.predict_proba(X_test_SVM)[:,1]
auc3 = metrics.roc_auc_score(y_test_SVM, y_pred_prob3)
fpr3 , tpr3, thresholds3 = roc_curve(y_test_SVM, y_pred_prob3)
#Naive Bayes
y_pred_prob4 = NB.predict_proba(X_test_NB)[:,1]
auc4 = metrics.roc_auc_score(y_test_NB, y_pred_prob4)
fpr4 , tpr4, thresholds4 = roc_curve(y_test_NB, y_pred_prob4)
#Decision Tree
y_pred_prob5 = DT.predict_proba(X_test_DT)[:,1]
auc5 = metrics.roc_auc_score(y_test_DT, y_pred_prob5)
fpr5 , tpr5, thresholds5 = roc_curve(y_test_DT, y_pred_prob5)
#Random Forest
y_pred_prob6 = RF.predict_proba(X_test_RF)[:,1]
auc6 = metrics.roc_auc_score(y_test_RF, y_pred_prob6)
fpr6 , tpr6, thresholds6 = roc_curve(y_test_RF, y_pred_prob6)
#XGBoost
y_pred_prob7 = XGB.predict_proba(X_test_XGB)[:,1]
auc7 = metrics.roc_auc_score(y_test_XGB, y_pred_prob7)
fpr7 , tpr7, thresholds7 = roc_curve(y_test_XGB, y_pred_prob7)
#Plot
rcParams["figure.figsize"] = 18,7
plt.plot([0,1],[0,1], 'k--', color="black")
plt.plot(fpr1, tpr1, label= "Logistic Regression" + " (AUC = %0.3f)" % auc1)
plt.plot(fpr2, tpr2, label= "KNN" + " (AUC = %0.3f)" % auc2)
plt.plot(fpr3, tpr3, label= "SVM" + " (AUC = %0.3f)" % auc3)
plt.plot(fpr4, tpr4, label= "Naive Bayes" + " (AUC = %0.3f)" % auc4)
plt.plot(fpr5, tpr5, label= "Decision Tree" + " (AUC = %0.3f)" % auc5)
plt.plot(fpr6, tpr6, label= "Random Forest" + " (AUC = %0.3f)" % auc6)
plt.plot(fpr7, tpr7, label= "XGBoost" + " (AUC = %0.3f)" % auc7)
#Axis
plt.xlabel("False Positive Rate", fontsize=15)
plt.ylabel("True Positive Rate", fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) - all models on test datasets', fontsize=20)
#Legend
plt.legend(loc="best",
prop={"size": 15},
title="AUC of Models",
title_fontsize="15",
frameon=True)
#Save fig
plt.savefig("models_comparision/ROC_all_models.png", bbox_inches="tight")
plt.show()
#Time of full code compilation
end = datetime.datetime.now()
print("Compilation of full script is:", end - start)
General summary
The main target of this project was to build and evaluate models to predict class of wine (red / white). Then the best model was choosen based on statistics of classification model as well as visualization of ROC curves of each model on one plot with AUC score.
The modelling dataset is really small because has only 7922 observations and 13 variables include target variable (wine_type). By doing so, results of models may be overfitting because regardless of the algorithms chosen, hiper parameters tunning or data engineering techniques implement, dataset large enought is the most important for models, data are good quality data is more important than algorithms.
EDA (Exploratory Data Analysis) summary
The input dataset without data engineering techniques contained: 6497 observations as well as 13 variables. Input dataset after concatenation was presented on two different reports: Pandas Profiling and Sweetviz.
Then many different data modifications processes were used: renaming columns, enumerate new variables, checking and changing data types, removing duplicates, missing variables, outliers detection by boxplots, Isolation Forest and Hampel method, checking of balance of target varaible, analysis of distribution of variables on 5 ways: histograms, Kolmogorov-Smirnov test, Shapiro-Wilk test, normal test from Scipy library, kurtosis nad skew.
Then data was visualized by scatter plots.
Modelling summary
Before modelling were carried out dummy coding and then variables selection by: analysis of correlation (Pearson / Spearman), VIF, IF, Forward / Backward selection, TREE, RFE. Then was made oversampling by SMOTE method so as to make dataset balanced. Last thing before modelling was creation useful function to count quickly and easily: confusion matrix. classification report, ROC curve, comparision of statistics of models on test and train datasets to eventually detect overfitting. Because of small dataset, data was selected only by CORR and VIF methods.
Generally, 7 models were build (include ensembling techniques like Random Forest and boosting): Logistic Regression, KNN, SVM, Naive Bayes, Decision Tree, Random Forest and XGBoost. Each model were build after tunning of hyperparameters in classifier and in train test split so as to find both the best parameters of classifier and the best configuration of train test split.
In Logistic Regression tunning of hiper parameters was performed by GrichSearchCV, in rest models (KNN, SVM, Naive Bayes, Decision Tree, Random Forest and XGBoost) tunning of hiper parameters was performed by using loop which created different combinations of all choosen classifier parameters so as to achieve best AUC on test dataset and also simmilar AUC result on train and test datasets. Tunning of train test split was performed in each model by loop.
Each model was evaluated by: confusion matrix, classification report, ROC curve, AUC, Accuracy, Precision, Recall, F1, Gini. Moreover in each model was performed comparision of results on train and test dataset to eventually detect overfitting. Moveorver 2 plots for easy business interpretaion was plot from each model: PROFIT, LIFT.
Finally statistics of models were compared on one Data Frame and on one ROC plot. By doing so, can say that Random FOrest presents the highest Accuracy and the highest Precision together with XGBoost, nevertheless, althought XGBoost has slightly worse Accuracy than Random Forest, XGBoost has significantly higher results on Recall, so the best model from build models is XGBoost.
Of course as was mentioned at the beginning of conclusion, both input dataset and modelling dataset were really small and results of models may be overfitting because regardless of the algorithms chosen, hiper parameters tunning or data engineering techniques implement, dataset large enought dataset is the most important for models. In this way solution could be to make data enrichment and either buy more data from for instance bank of data or perform some additional analysies or perform surveys to collect more data.